Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HLT farm crash in run 379617 #44769

Open
jalimena opened this issue Apr 18, 2024 · 23 comments
Open

HLT farm crash in run 379617 #44769

jalimena opened this issue Apr 18, 2024 · 23 comments

Comments

@jalimena
Copy link
Contributor

jalimena commented Apr 18, 2024

Reporting the crashes in run 379617

Seems to crash on CPU and GPU

To reproduce:

cmsrel CMSSW_14_0_5_patch1
cd CMSSW_14_0_5_patch1/src
cmsenv
#!/bin/bash -ex

# CMSSW_14_0_5_patch1

wget https://raw.githubusercontent.com/mmusich/hltScripts/master/fog/convertFromRawToEDM.py 
cmsRun convertFromRawToEDM.py /eos/cms/store/group/tsg/FOG/error_stream/run379617/run379617_ls0004_index000248_fu-c2b02-19-01_pid1561106.raw converted.root

hltGetConfiguration run:379617 \
  --globaltag 140X_dataRun3_HLT_v3 \
  --data \
  --no-prescale \
  --no-output \
  --max-events -1 \
  --input file:converted.root  > hlt.py
  
cat <<@EOF >> hlt.py
process.options.wantSummary = True

process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
@EOF

cmsRun hlt.py &> hlt.log

triggers

cmsRun: src/RecoLocalTracker/SiPixelRecHits/plugins/alpaka/PixelRecHits.h:179: void alpaka_serial_sync::pixelRecHits::GetHits<TrackerTraits>::operator()(const TAcc&, const pixelCPEforDevice::ParamsOnDevi
ceT<TrackerTraits>*, const BeamSpotPOD*, SiPixelDigisSoAConstView, uint32_t, uint32_t, SiPixelClustersSoAConstView, TrackingRecHitSoAView<TrackerTraits>) const [with TAcc = alpaka::AccCpuSerial<std::inte
gral_constant<long unsigned int, 1>, unsigned int>; <template-parameter-2-2> = void; TrackerTraits = pixelTopology::Phase1; SiPixelDigisSoAConstView = SiPixelDigisLayout<>::ConstViewTemplateFreeParams<12
8, false, true, false>; uint32_t = unsigned int; SiPixelClustersSoAConstView = SiPixelClustersLayout<>::ConstViewTemplateFreeParams<128, false, true, false>; TrackingRecHitSoAView<TrackerTraits> = Tracki
ngRecHitSoA<pixelTopology::Phase1>::Layout<>::ViewTemplateFreeParams<128, false, true, false>]: Assertion `h < (uint32_t)hits.metadata().size()' failed.


A fatal system signal has occurred: abort signal
The following is the call stack containing the origin of the signal.

@cms-sw/hlt-l2 FYI
@cms-sw/heterogeneous-l2 FYI

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 18, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

A new Issue was created by @jalimena.

@Dr15Jones, @antoniovilela, @rappoccio, @makortel, @smuzaffar, @sextonkennedy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@mmusich
Copy link
Contributor

mmusich commented Apr 18, 2024

assign hlt, heterogeneous

@mmusich
Copy link
Contributor

mmusich commented Apr 18, 2024

@cms-sw/tracking-pog-l2 @cms-sw/trk-dpg-l2 @AdrianoDee FYI

@cmsbuild
Copy link
Contributor

New categories assigned: hlt,heterogeneous

@Martin-Grunewald,@mmusich,@fwyzard,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

@mmusich
Copy link
Contributor

mmusich commented Apr 18, 2024

for the record, the same script as above run on lxplus-gpu results in:

%MSG-i AlpakaService:  (NoModuleName) 18-Apr-2024 10:55:09 CEST pre-events
AlpakaServiceSerialSync succesfully initialised.
Found 1 device:
  - Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz
%MSG
%MSG-i CUDAService:  (NoModuleName) 18-Apr-2024 10:55:10 CEST pre-events
CUDA runtime version 12.2, driver version 12.4, NVIDIA driver version 550.54.15
CUDA device 0: Tesla T4 (sm_75)
%MSG
%MSG-i AlpakaService:  (NoModuleName) 18-Apr-2024 10:55:10 CEST pre-events
AlpakaServiceCudaAsync succesfully initialised.
Found 1 device:
  - Tesla T4
%MSG
2024-04-18 10:55:19.687525: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different com
putation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
18-Apr-2024 10:55:22 CEST  Initiating request to open file file:converted.root
18-Apr-2024 10:55:22 CEST  Successfully opened file file:converted.root
2024-04-18 10:55:31.554180: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-04-18 10:55:31.585301: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:353] MLIR V1 optimization pass is not enabled
#--------------------------------------------------------------------------
#                         FastJet release 3.4.1
#                 M. Cacciari, G.P. Salam and G. Soyez                  
#     A software package for jet finding and analysis at colliders      
#                           http://fastjet.fr                           
#	                                                                      
# Please cite EPJC72(2012)1896 [arXiv:1111.6097] if you use this package
# for scientific work and optionally PLB641(2006)57 [hep-ph/0512210].   
#                                                                       
# FastJet is provided without warranty under the GNU GPL v2 or higher.  
# It uses T. Chan's closest pair algorithm, S. Fortune's Voronoi code
# and 3rd party plugin jet algorithms. See COPYING file for details.
#--------------------------------------------------------------------------
%MSG-w InvalidInputTag:  CSCHaloDataProducer:hltCSCHaloData  18-Apr-2024 10:56:47 CEST Run: 379617 Event: 5531783
 The Cosmic Muon collection does not appear to be in the event. These beam halo  identification variables will be empty
%MSG
%MSG-e GsfMultiStateUpdator:  GsfTrackProducer:hltEgammaGsfTracksUnseeded  18-Apr-2024 10:57:13 CEST Run: 379617 Event: 5535364
KF updated state 0 is invalid. skipping.
%MSG
%MSG-e InvalidState:  GsfTrackProducer:hltEgammaGsfTracksUnseeded  18-Apr-2024 10:57:13 CEST Run: 379617 Event: 5535364
first hit
%MSG
/data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_5_patch1-el9_amd64_gcc12/build/CMSSW_14_0_5_patch1-build/el9_amd64_gcc12/external/alpaka/1.1.0-1dfa0fea4735fa1aac182d6acd03b4c8/include/alpaka/event/
EventUniformCudaHipRt.hpp(66) 'TApi::eventDestroy(m_UniformCudaHipEvent)' returned error  : 'cudaErrorAssert': 'device-side assert triggered'!
/data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_5_patch1-el9_amd64_gcc12/build/CMSSW_14_0_5_patch1-build/el9_amd64_gcc12/external/alpaka/1.1.0-1dfa0fea4735fa1aac182d6acd03b4c8/include/alpaka/mem/bu
f/BufUniformCudaHipRt.hpp(356) 'TApi::hostFree(ptr)' returned error  : 'cudaErrorAssert': 'device-side assert triggered'!
terminate called after throwing an instance of 'std::runtime_error'
  what():  /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_5_patch1-el9_amd64_gcc12/build/CMSSW_14_0_5_patch1-build/el9_amd64_gcc12/external/alpaka/1.1.0-1dfa0fea4735fa1aac182d6acd03b4c8/include/al
paka/event/EventUniformCudaHipRt.hpp(160) 'TApi::eventRecord(event.getNativeHandle(), queue.getNativeHandle())' returned error  : 'cudaErrorAssert': 'device-side assert triggered'!


A fatal system signal has occurred: abort signal
The following is the call stack containing the origin of the signal.

Thu Apr 18 10:57:15 CEST 2024
Thread 9 (Thread 0x7f5fe5353640 (LWP 1942464) "cmsRun"):
#0  0x00007f60b4313975 in clock_nanosleep@GLIBC_2.2.5 () from /lib64/libc.so.6
#1  0x00007f60b4318527 in nanosleep () from /lib64/libc.so.6
#2  0x00007f60b431845e in sleep () from /lib64/libc.so.6
#3  0x00007f60acc9dd40 in sig_pause_for_stacktrace () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_5/lib/el9_amd64_gcc12/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00007f601cfd9fc8 in alpaka::TaskKernelCpuSerial<std::integral_constant<unsigned long, 2ul>, unsigned int, alpaka_serial_sync::FillRhfIndex, reco::PFRecHitSoALayout<128ul, false>::ConstViewTemplateF
reeParams<128ul, false, true, false> const&, reco::PFClusteringVarsSoALayout<128ul, false>::ViewTemplateFreeParams<128ul, false, true, false>&, reco::PFRecHitFractionSoALayout<128ul, false>::ViewTemplate
FreeParams<128ul, false, true, false>&>::operator()() const () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw-patch/CMSSW_14_0_5_patch1/lib/el9_amd64_gcc12/pluginRecoParticleFlowPFClusterProducersPlug
insPortableSerialSync.so
#6  0x00007f601cfdfd8b in alpaka_serial_sync::PFClusterProducerKernel::execute(alpaka::QueueGenericThreadsBlocking<alpaka::DevCpu>&, PortableHostCollection<reco::PFClusterParamsSoALayout<128ul, false> > 
const&, PortableHostCollection<reco::PFRecHitHCALTopologySoALayout<128ul, false> > const&, PortableHostCollection<reco::PFClusteringVarsSoALayout<128ul, false> >&, PortableHostCollection<reco::PFClusteri
ngEdgeVarsSoALayout<128ul, false> >&, PortableHostCollection<reco::PFRecHitSoALayout<128ul, false> > const&, PortableHostCollection<reco::PFClusterSoALayout<128ul, false> >&, PortableHostCollection<reco:
:PFRecHitFractionSoALayout<128ul, false> >&) () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw-patch/CMSSW_14_0_5_patch1/lib/el9_amd64_gcc12/pluginRecoParticleFlowPFClusterProducersPluginsPortableSeri
alSync.so
#7  0x00007f601cfd708e in alpaka_serial_sync::PFClusterSoAProducer::produce(alpaka_serial_sync::device::Event&, alpaka_serial_sync::device::EventSetup const&) () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/c
ms/cmssw-patch/CMSSW_14_0_5_patch1/lib/el9_amd64_gcc12/pluginRecoParticleFlowPFClusterProducersPluginsPortableSerialSync.so
#8  0x00007f601cfd5215 in alpaka_serial_sync::stream::EDProducer<>::produce(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw-patch/CMSSW_14_0_5_patch1/lib/el9_amd
64_gcc12/pluginRecoParticleFlowPFClusterProducersPluginsPortableSerialSync.so
#9  0x00007f60b66483c1 in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el9_amd64_gcc12
/cms/cmssw/CMSSW_14_0_5/lib/el9_amd64_gcc12/libFWCoreFramework.so
#10 0x00007f60b662c04e in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/
CMSSW_14_0_5/lib/el9_amd64_gcc12/libFWCoreFramework.so
#11 0x00007f60b65b9159 in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::excepti
on_ptr, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchA
ctionType)1>::Context const*) () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_5/lib/el9_amd64_gcc12/libFWCoreFramework.so
#12 0x00007f60b65b96c4 in edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute() () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_5/li
b/el9_amd64_gcc12/libFWCoreFramework.so
#13 0x00007f60b6746f28 in tbb::detail::d1::function_task<edm::WaitingTaskList::announce()::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CM
SSW_14_0_5/lib/el9_amd64_gcc12/libFWCoreConcurrency.so
#14 0x00007f60b571a91b in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::outermost_worker_waiter> (t=0x7f6058783500, waiter=..., this=0x7f60b29bbe80) at /data/cmsbld/jenkins
/workspace/auto-builds/CMSSW_14_0_3-el9_amd64_gcc12/build/CMSSW_14_0_3-build/BUILD/el9_amd64_gcc12/external/tbb/v2021.9.0-d33db04d4520c6ff791eab900054e986/tbb-v2021.9.0/src/tbb/task_dispatcher.h:322
#15 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::outermost_worker_waiter> (t=0x0, waiter=..., this=0x7f60b29bbe80) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_3-el9_
amd64_gcc12/build/CMSSW_14_0_3-build/BUILD/el9_amd64_gcc12/external/tbb/v2021.9.0-d33db04d4520c6ff791eab900054e986/tbb-v2021.9.0/src/tbb/task_dispatcher.h:458
#16 tbb::detail::r1::arena::process (tls=..., this=<optimized out>) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_3-el9_amd64_gcc12/build/CMSSW_14_0_3-build/BUILD/el9_amd64_gcc12/external/tbb/
v2021.9.0-d33db04d4520c6ff791eab900054e986/tbb-v2021.9.0/src/tbb/arena.cpp:137
#17 tbb::detail::r1::market::process (this=<optimized out>, j=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_3-el9_amd64_gcc12/build/CMSSW_14_0_3-build/BUILD/el9_amd64_gcc12/external/tbb/v
2021.9.0-d33db04d4520c6ff791eab900054e986/tbb-v2021.9.0/src/tbb/market.cpp:599
#18 0x00007f60b571cace in tbb::detail::r1::rml::private_worker::run (this=0x7f60afc07100) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_3-el9_amd64_gcc12/build/CMSSW_14_0_3-build/BUILD/el9_amd
64_gcc12/external/tbb/v2021.9.0-d33db04d4520c6ff791eab900054e986/tbb-v2021.9.0/src/tbb/private_server.cpp:271
#19 tbb::detail::r1::rml::private_worker::thread_routine (arg=0x7f60afc07100) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_3-el9_amd64_gcc12/build/CMSSW_14_0_3-build/BUILD/el9_amd64_gcc12/ext
ernal/tbb/v2021.9.0-d33db04d4520c6ff791eab900054e986/tbb-v2021.9.0/src/tbb/private_server.cpp:221
#20 0x00007f60b429f802 in start_thread () from /lib64/libc.so.6
#21 0x00007f60b423f450 in clone3 () from /lib64/libc.so.6
Thread 8 (Thread 0x7f600cc2b640 (LWP 1942463) "cmsRun"):
#0  0x00007f60b429c39a in __futex_abstimed_wait_common () from /lib64/libc.so.6
#1  0x00007f60b42a7838 in __new_sem_wait_slow64.constprop.0 () from /lib64/libc.so.6
#2  0x00007f609aa53dda in ?? () from /lib64/libcuda.so.1
#3  0x00007f609aa63373 in ?? () from /lib64/libcuda.so.1
#4  0x00007f60b429f802 in start_thread () from /lib64/libc.so.6
#5  0x00007f60b423f450 in clone3 () from /lib64/libc.so.6
Thread 7 (Thread 0x7f60257ff640 (LWP 1942389) "cmsRun"):
#0  0x00007f60b429c39a in __futex_abstimed_wait_common () from /lib64/libc.so.6
#1  0x00007f60b429eba0 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libc.so.6
#2  0x00007f604a77787e in Eigen::ThreadPoolTempl<tsl::thread::EigenEnvironment>::WaitForWork(Eigen::EventCount::Waiter*, tsl::thread::EigenEnvironment::Task*) () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/c
ms/cmssw-patch/CMSSW_14_0_5_patch1/external/el9_amd64_gcc12/lib/libtensorflow_cc.so.2
#3  0x00007f604a777de3 in Eigen::ThreadPoolTempl<tsl::thread::EigenEnvironment>::WorkerLoop(int) () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw-patch/CMSSW_14_0_5_patch1/external/el9_amd64_gcc12/li
b/libtensorflow_cc.so.2
#4  0x00007f604a7755f8 in std::_Function_handler<void (), tsl::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) () from /cvmfs/cms.cern.ch/e
l9_amd64_gcc12/cms/cmssw-patch/CMSSW_14_0_5_patch1/external/el9_amd64_gcc12/lib/libtensorflow_cc.so.2
#5  0x00007f603d482422 in tsl::(anonymous namespace)::PThread::ThreadFn(void*) () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw-patch/CMSSW_14_0_5_patch1/external/el9_amd64_gcc12/lib/libtensorflow_fr
amework.so.2
#6  0x00007f60b429f802 in start_thread () from /lib64/libc.so.6
#7  0x00007f60b423f450 in clone3 () from /lib64/libc.so.6
Thread 6 (Thread 0x7f602a7ff640 (LWP 1942388) "cmsRun"):
#0  0x00007f60b429c39a in __futex_abstimed_wait_common () from /lib64/libc.so.6
#1  0x00007f60b429eba0 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libc.so.6
#2  0x00007f604a77787e in Eigen::ThreadPoolTempl<tsl::thread::EigenEnvironment>::WaitForWork(Eigen::EventCount::Waiter*, tsl::thread::EigenEnvironment::Task*) () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/c
ms/cmssw-patch/CMSSW_14_0_5_patch1/external/el9_amd64_gcc12/lib/libtensorflow_cc.so.2
#3  0x00007f604a777de3 in Eigen::ThreadPoolTempl<tsl::thread::EigenEnvironment>::WorkerLoop(int) () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw-patch/CMSSW_14_0_5_patch1/external/el9_amd64_gcc12/li
b/libtensorflow_cc.so.2
#4  0x00007f604a7755f8 in std::_Function_handler<void (), tsl::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) () from /cvmfs/cms.cern.ch/e
l9_amd64_gcc12/cms/cmssw-patch/CMSSW_14_0_5_patch1/external/el9_amd64_gcc12/lib/libtensorflow_cc.so.2
#5  0x00007f603d482422 in tsl::(anonymous namespace)::PThread::ThreadFn(void*) () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw-patch/CMSSW_14_0_5_patch1/external/el9_amd64_gcc12/lib/libtensorflow_fr
amework.so.2
#6  0x00007f60b429f802 in start_thread () from /lib64/libc.so.6
#7  0x00007f60b423f450 in clone3 () from /lib64/libc.so.6
Thread 5 (Thread 0x7f602b32a640 (LWP 1942387) "cmsRun"):
#0  0x00007f60b429c39a in __futex_abstimed_wait_common () from /lib64/libc.so.6
#1  0x00007f60b429eba0 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libc.so.6
#2  0x00007f604a77787e in Eigen::ThreadPoolTempl<tsl::thread::EigenEnvironment>::WaitForWork(Eigen::EventCount::Waiter*, tsl::thread::EigenEnvironment::Task*) () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/c
ms/cmssw-patch/CMSSW_14_0_5_patch1/external/el9_amd64_gcc12/lib/libtensorflow_cc.so.2
#3  0x00007f604a777de3 in Eigen::ThreadPoolTempl<tsl::thread::EigenEnvironment>::WorkerLoop(int) () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw-patch/CMSSW_14_0_5_patch1/external/el9_amd64_gcc12/li
b/libtensorflow_cc.so.2
#4  0x00007f604a7755f8 in std::_Function_handler<void (), tsl::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) () from /cvmfs/cms.cern.ch/e
l9_amd64_gcc12/cms/cmssw-patch/CMSSW_14_0_5_patch1/external/el9_amd64_gcc12/lib/libtensorflow_cc.so.2
#5  0x00007f603d482422 in tsl::(anonymous namespace)::PThread::ThreadFn(void*) () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw-patch/CMSSW_14_0_5_patch1/external/el9_amd64_gcc12/lib/libtensorflow_fr
amework.so.2
#6  0x00007f60b429f802 in start_thread () from /lib64/libc.so.6
#7  0x00007f60b423f450 in clone3 () from /lib64/libc.so.6
Thread 4 (Thread 0x7f60735de640 (LWP 1942342) "cuda-EvtHandlr"):
#0  0x00007f60b434291f in poll () from /lib64/libc.so.6
#1  0x00007f609aa6889f in ?? () from /lib64/libcuda.so.1
#2  0x00007f609ab36dcf in ?? () from /lib64/libcuda.so.1
#3  0x00007f609aa63373 in ?? () from /lib64/libcuda.so.1
#4  0x00007f60b429f802 in start_thread () from /lib64/libc.so.6
#5  0x00007f60b423f450 in clone3 () from /lib64/libc.so.6
Thread 3 (Thread 0x7f607b25f640 (LWP 1942328) "cuda0000340000e"):
#0  0x00007f60b434291f in poll () from /lib64/libc.so.6
#1  0x00007f609aa6889f in ?? () from /lib64/libcuda.so.1
#2  0x00007f609ab36dcf in ?? () from /lib64/libcuda.so.1
#3  0x00007f609aa63373 in ?? () from /lib64/libcuda.so.1
#4  0x00007f60b429f802 in start_thread () from /lib64/libc.so.6
#5  0x00007f60b423f450 in clone3 () from /lib64/libc.so.6
Thread 2 (Thread 0x7f607bb05640 (LWP 1942321) "cmsRun"):
#0  0x00007f60b431830f in wait4 () from /lib64/libc.so.6
#1  0x00007f60acc9de97 in edm::service::cmssw_stacktrace_fork() () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_5/lib/el9_amd64_gcc12/pluginFWCoreServicesPlugins.so
#2  0x00007f60acca0d6a in edm::service::InitRootHandlers::stacktraceHelperThread() () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_5/lib/el9_amd64_gcc12/pluginFWCoreServicesPlugins.so
#3  0x00007f60b46d84d3 in std::execute_native_thread_routine (__p=0x7f60ad192670) at ../../../../../libstdc++-v3/src/c++11/thread.cc:82
#4  0x00007f60b429f802 in start_thread () from /lib64/libc.so.6
#5  0x00007f60b423f450 in clone3 () from /lib64/libc.so.6
Thread 1 (Thread 0x7f60b50e7640 (LWP 1942239) "cmsRun"):
#0  0x00007f60b434291f in poll () from /lib64/libc.so.6
#1  0x00007f60accec62f in full_read.constprop () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_5/lib/el9_amd64_gcc12/pluginFWCoreServicesPlugins.so
#2  0x00007f60acca0e3c in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_5/lib/el9_amd64_gcc12/pluginFWCoreServicesPlugins.so
#3  0x00007f60acca17a0 in sig_dostack_then_abort () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_5/lib/el9_amd64_gcc12/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00007f60b42a154c in __pthread_kill_implementation () from /lib64/libc.so.6
#6  0x00007f60b4254d06 in raise () from /lib64/libc.so.6
#7  0x00007f60b42287f3 in abort () from /lib64/libc.so.6
#8  0x00007f60b46a1a49 in __gnu_cxx::__verbose_terminate_handler () at ../../../../libstdc++-v3/libsupc++/vterminate.cc:95
#9  0x00007f60b46ace6a in __cxxabiv1::__terminate (handler=<optimized out>) at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:48
#10 0x00007f60b46abed9 in __cxa_call_terminate (ue_header=0x7f5fd4a6f800) at ../../../../libstdc++-v3/libsupc++/eh_call.cc:54
#11 0x00007f60b46ac5f6 in __cxxabiv1::__gxx_personality_v0 (version=<optimized out>, actions=6, exception_class=5138137972254386944, ue_header=<optimized out>, context=0x7ffd862fb2c0) at ../../../../libs
tdc++-v3/libsupc++/eh_personality.cc:688
#12 0x00007f60b515c864 in _Unwind_RaiseException_Phase2 (exc=0x7f5fd4a6f800, context=0x7ffd862fb2c0, frames_p=0x7ffd862fb1c8) at ../../../libgcc/unwind.inc:64
#13 0x00007f60b515d2bd in _Unwind_Resume (exc=0x7f5fd4a6f800) at ../../../libgcc/unwind.inc:242
#14 0x00007f6021041cce in cms::alpakatools::CachingAllocator<alpaka::DevCpu, alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false> >::free(void*) [clone .cold] () from /cvmfs/
cms.cern.ch/el9_amd64_gcc12/cms/cmssw-patch/CMSSW_14_0_5_patch1/lib/el9_amd64_gcc12/pluginRecoTrackerPixelSeedingPortableCudaAsync.so
#15 0x00007f602105b638 in std::_Sp_counted_ptr_inplace<alpaka::detail::BufCpuImpl<std::byte, std::integral_constant<unsigned long, 1ul>, unsigned int>, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::
_M_dispose() () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw-patch/CMSSW_14_0_5_patch1/lib/el9_amd64_gcc12/pluginRecoTrackerPixelSeedingPortableCudaAsync.so
#16 0x00007f602104aa07 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw-patch/CMSSW_14_0_5_patch1/lib/el9_amd64_gcc12/pluginRecoTrac
kerPixelSeedingPortableCudaAsync.so
#17 0x00007f602105863c in std::_Sp_counted_ptr_inplace<std::tuple<TracksHost<pixelTopology::Phase1>, std::shared_ptr<alpaka_cuda_async::EDMetadata> >, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_
M_dispose() () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw-patch/CMSSW_14_0_5_patch1/lib/el9_amd64_gcc12/pluginRecoTrackerPixelSeedingPortableCudaAsync.so
#18 0x00007f602104aa07 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw-patch/CMSSW_14_0_5_patch1/lib/el9_amd64_gcc12/pluginRecoTrac
kerPixelSeedingPortableCudaAsync.so
#19 0x00007f602105a49b in std::any::_Manager_external<std::shared_ptr<std::tuple<TracksHost<pixelTopology::Phase1>, std::shared_ptr<alpaka_cuda_async::EDMetadata> > > >::_S_manage(std::any::_Op, std::any
 const*, std::any::_Arg*) () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw-patch/CMSSW_14_0_5_patch1/lib/el9_amd64_gcc12/pluginRecoTrackerPixelSeedingPortableCudaAsync.so
#20 0x00007f60b66186ca in std::_Sp_counted_ptr_inplace<std::any, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_5/lib/el9_
amd64_gcc12/libFWCoreFramework.so
#21 0x00007f60b652a1a7 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_5/lib/el9_amd64_gcc12/libFWCoreFramework.so
#22 0x00007f60b661896f in edm::FunctorWaitingTask<edm::TransformerBase::transformImpAsync(edm::WaitingTaskHolder, unsigned long, edm::ActivityRegistry*, edm::ProducerBase const&, edm::EventForTransformer
&) const::{lambda(std::__exception_ptr::exception_ptr const*)#1}>::~FunctorWaitingTask() () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_5/lib/el9_amd64_gcc12/libFWCoreFramework.so
#23 0x00007f60b6747232 in tbb::detail::d1::function_task<edm::WaitingTaskWithArenaHolder::doneWaiting(std::__exception_ptr::exception_ptr)::{lambda()#1}::operator()() const::{lambda()#1}>::execute(tbb::d
etail::d1::execution_data&) () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_5/lib/el9_amd64_gcc12/libFWCoreConcurrency.so
#24 0x00007f60b5723241 in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (waiter=..., t=<optimized out>, this=0x7f60b29bbe00) at /data/cmsbld/jenkins/worksp
ace/auto-builds/CMSSW_14_0_3-el9_amd64_gcc12/build/CMSSW_14_0_3-build/BUILD/el9_amd64_gcc12/external/tbb/v2021.9.0-d33db04d4520c6ff791eab900054e986/tbb-v2021.9.0/src/tbb/task_dispatcher.h:322
#25 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (waiter=..., t=<optimized out>, this=0x7f60b29bbe00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_3-
el9_amd64_gcc12/build/CMSSW_14_0_3-build/BUILD/el9_amd64_gcc12/external/tbb/v2021.9.0-d33db04d4520c6ff791eab900054e986/tbb-v2021.9.0/src/tbb/task_dispatcher.h:458
#26 tbb::detail::r1::task_dispatcher::execute_and_wait (t=<optimized out>, wait_ctx=..., w_ctx=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_3-el9_amd64_gcc12/build/CMSSW_14_0_3-build/BUI
LD/el9_amd64_gcc12/external/tbb/v2021.9.0-d33db04d4520c6ff791eab900054e986/tbb-v2021.9.0/src/tbb/task_dispatcher.cpp:168
#27 0x00007f60b653da6b in edm::FinalWaitingTask::wait() () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_5/lib/el9_amd64_gcc12/libFWCoreFramework.so
#28 0x00007f60b65471ea in edm::EventProcessor::processRuns() () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_5/lib/el9_amd64_gcc12/libFWCoreFramework.so
#29 0x00007f60b6547741 in edm::EventProcessor::runToCompletion() () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_5/lib/el9_amd64_gcc12/libFWCoreFramework.so
#30 0x00000000004074f5 in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()
#31 0x00007f60b570f96d in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_3-el9_amd64_gcc12/build/CMSSW_14_0_3-build/BUILD/el9_amd64_gcc
12/external/tbb/v2021.9.0-d33db04d4520c6ff791eab900054e986/tbb-v2021.9.0/src/tbb/arena.cpp:688
#32 0x0000000000408ee2 in main::{lambda()#1}::operator()() const ()
#33 0x000000000040517c in main ()

Current Modules:

Module: none (crashed)
Module: alpaka_serial_sync::PFClusterSoAProducer:hltParticleFlowClusterHBHESoASerialSync

A fatal system signal has occurred: abort signal

@Dr15Jones
Copy link
Contributor

From the stack trace, it looks to me like an exception happened while the system was handling another exception. When that happens, the C++ runtime aborts the job.

@makortel
Copy link
Contributor

From the stack trace, it looks to me like an exception happened while the system was handling another exception. When that happens, the C++ runtime aborts the job.

"The other exception" is removed by #44730 (master) / #44763 (14_0_X)

@makortel
Copy link
Contributor

The assertion referred in the issue description is
https://github.com/cms-sw/cmssw/blob/master/RecoLocalTracker/SiPixelRecHits/plugins/alpaka/PixelRecHits.h#L179

Does this condition mean there are more hits than the allocated size of the output buffer? @AdrianoDee

@makortel
Copy link
Contributor

assign reconstruction

FYI @cms-sw/trk-dpg-l2

@cmsbuild
Copy link
Contributor

New categories assigned: reconstruction

@jfernan2,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks

@mmusich
Copy link
Contributor

mmusich commented Apr 18, 2024

type trk

@cmsbuild cmsbuild added the trk label Apr 18, 2024
@AdrianoDee
Copy link
Contributor

AdrianoDee commented Apr 18, 2024

The assertion referred in the issue description is https://github.com/cms-sw/cmssw/blob/master/RecoLocalTracker/SiPixelRecHits/plugins/alpaka/PixelRecHits.h#L179

Does this condition mean there are more hits than the allocated size of the output buffer? @AdrianoDee

Sort of but not entirely Matti, so the culprit is

constexpr auto MAX_HITS = TrackerTraits::maxNumberOfHits;
for (uint32_t i : cms::alpakatools::independent_group_elements(acc, numberOfModules + 1)) {
if (clus_view[i].clusModuleStart() > MAX_HITS)
clus_view[i].clusModuleStart() = MAX_HITS;
}

where we cut the number of hits to "only":

static constexpr uint32_t maxNumberOfHits = 48 * 1024;

And the event of the crash has:

SiPixelClusterizerAlpaka results:
 > no. of digis: 301020
 > no. of active modules: 1794
 > no. of clusters: 50684
 > bpix2 offset: 19595

The poor man solution would be to rise the max to something safer such asstatic constexpr uint32_t maxNumberOfHits = 96 * 1024;. But seems to me a waste also because this procedure is anyway a remnant of that intermediate phase of the Alpaka port when we had no runtime sized histograms (before #43064).

So I have a proposal for a solution here that involves a slightly bigger code refactoring in order to drop the fixed number of hits. That branch is for CMSSW_14_0_5_patch1, and with it the failing job runs smoothly.

@mmusich
Copy link
Contributor

mmusich commented Apr 18, 2024

@mmusich
Copy link
Contributor

mmusich commented Apr 19, 2024

I tested CMSSW_14_0_5_patch2 (which contains the fix PR #44774) with the following setup.

I repacked all the error stream files in /eos/cms/store/group/tsg/FOG/error_stream/run379617/ and put them in /eos/cms/store/group/tsg/FOG/debug/240417_run379617 with the instructions at #44769 (comment).

Then I tested with:

#!/bin/bash -ex                                                                                                                                                                               
# CMSSW_14_0_5_patch2                                                                                                     
hltGetConfiguration run:379617 \
  --globaltag 140X_dataRun3_HLT_v3 \
  --data \
  --no-prescale \
  --no-output \
  --max-events -1 \
  --input file:converted.root  > hlt.py

cat <<@EOF >> hlt.py
process.options.wantSummary = True

process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
@EOF

for inputfile in $(eos ls  /eos/cms/store/group/tsg/FOG/debug/240417_run379617/ | grep '\.root$'); do
    outputfile="${inputfile%.root}"
    cp hlt.py hlt_toRun.py
    sed -i "s/file:converted\.root/\/store\/group\/tsg\/FOG\/debug\/240417_run379617\/${inputfile}/g" hlt_toRun.py
    cmsRun hlt_toRun.py &> "${outputfile}.log"
done

on both CPU (lxplus) and GPU (lxplus-gpu) and had seen no crash nor stuck jobs (as they were originally reported from the online operations team) so far, so the fix seems to be effective.
On the other hand, the nature of these stuck jobs is at the moment not understood and TSG would like to perform deeper checks in terms of both physics and timing performance with the new patch release before moving it online (coupled with the new HLT menu V1.1).

@makortel
Copy link
Contributor

Is there more information on the stuck jobs somewhere?

@mmusich
Copy link
Contributor

mmusich commented Apr 19, 2024

Is there more information on the stuck jobs somewhere?

I don't have more information than what reported by @fwyzard. We've been trying (unsuccessfully so far) offline to reproduce using streamer files from Andrea.

@fwyzard
Copy link
Contributor

fwyzard commented Apr 19, 2024

hi @makortel ,
I investigated a bit the stuck jobs before killing them:

(gdb) info threads
  Id   Target Id                                            Frame
* 1    Thread 0x7f4405260640 (LWP 3275716) "cmsRun"         0x00007f44060da022 in pthread_rwlock_rdlock () from /lib64/libpthread.so.0
  2    Thread 0x7f43c80a6700 (LWP 3276856) "cmsRun"         0x00007f44060deab4 in read () from /lib64/libpthread.so.0
  3    Thread 0x7f43c155f700 (LWP 3276885) "cuda-EvtHandlr" 0x00007f4405e2a0e1 in poll () from /lib64/libc.so.6
  4    Thread 0x7f43affff700 (LWP 3276886) "cuda-EvtHandlr" 0x00007f43e7245bb5 in ?? () from /lib64/libcuda.so.1
  5    Thread 0x7f4387294700 (LWP 3277105) "cmsRun"         0x00007f44060da022 in pthread_rwlock_rdlock () from /lib64/libpthread.so.0
  6    Thread 0x7f4386893700 (LWP 3277106) "cmsRun"         0x00007f44060de82d in __lll_lock_wait () from /lib64/libpthread.so.0
  7    Thread 0x7f43851ff700 (LWP 3277113) "cmsRun"         0x00007f44060da022 in pthread_rwlock_rdlock () from /lib64/libpthread.so.0
  8    Thread 0x7f43843ff700 (LWP 3277116) "cmsRun"         0x00007f44060da022 in pthread_rwlock_rdlock () from /lib64/libpthread.so.0
  9    Thread 0x7f43833ff700 (LWP 3277123) "cmsRun"         0x00007f44060da022 in pthread_rwlock_rdlock () from /lib64/libpthread.so.0
  10   Thread 0x7f43825fe700 (LWP 3277125) "cmsRun"         0x00007f44060da50f in pthread_rwlock_wrlock () from /lib64/libpthread.so.0
  11   Thread 0x7f43817ff700 (LWP 3277128) "cmsRun"         0x00007f44060da022 in pthread_rwlock_rdlock () from /lib64/libpthread.so.0
  12   Thread 0x7f4380dfe700 (LWP 3277129) "cmsRun"         0x00007f44060de82d in __lll_lock_wait () from /lib64/libpthread.so.0
  13   Thread 0x7f437ffff700 (LWP 3277131) "cmsRun"         0x00007f44060de82d in __lll_lock_wait () from /lib64/libpthread.so.0
  14   Thread 0x7f437f5fe700 (LWP 3277133) "cmsRun"         0x00007f44060de82d in __lll_lock_wait () from /lib64/libpthread.so.0
  15   Thread 0x7f437e9fd700 (LWP 3277136) "cmsRun"         0x00007f44060da022 in pthread_rwlock_rdlock () from /lib64/libpthread.so.0
  16   Thread 0x7f437dffc700 (LWP 3277138) "cmsRun"         0x00007f44060de82d in __lll_lock_wait () from /lib64/libpthread.so.0
  17   Thread 0x7f43251fc700 (LWP 3277140) "cmsRun"         0x00007f44060db45c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  18   Thread 0x7f43245fb700 (LWP 3277141) "cmsRun"         0x00007f44060da022 in pthread_rwlock_rdlock () from /lib64/libpthread.so.0
  19   Thread 0x7f43237ff700 (LWP 3277149) "cmsRun"         0x00007f44060da022 in pthread_rwlock_rdlock () from /lib64/libpthread.so.0
  20   Thread 0x7f43229ff700 (LWP 3277151) "cmsRun"         0x00007f44060db45c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  21   Thread 0x7f43215fd700 (LWP 3277155) "cmsRun"         0x00007f44060de82d in __lll_lock_wait () from /lib64/libpthread.so.0
  22   Thread 0x7f43209fc700 (LWP 3277159) "cmsRun"         0x00007f44060db45c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  23   Thread 0x7f43221fe700 (LWP 3277160) "cmsRun"         0x00007f44060da022 in pthread_rwlock_rdlock () from /lib64/libpthread.so.0
  24   Thread 0x7f43201fb700 (LWP 3277161) "cmsRun"         0x00007f44060da455 in pthread_rwlock_wrlock () from /lib64/libpthread.so.0
  25   Thread 0x7f431f3ff700 (LWP 3277171) "cmsRun"         0x00007f44060de82d in __lll_lock_wait () from /lib64/libpthread.so.0
  26   Thread 0x7f431e9fe700 (LWP 3277174) "cmsRun"         0x00007f44060da022 in pthread_rwlock_rdlock () from /lib64/libpthread.so.0
  27   Thread 0x7f431dbff700 (LWP 3277180) "cmsRun"         0x00007f44060da455 in pthread_rwlock_wrlock () from /lib64/libpthread.so.0
  28   Thread 0x7f431cffe700 (LWP 3277182) "cmsRun"         0x00007f44060da455 in pthread_rwlock_wrlock () from /lib64/libpthread.so.0
  29   Thread 0x7f431bfff700 (LWP 3277183) "cmsRun"         0x00007f44060da022 in pthread_rwlock_rdlock () from /lib64/libpthread.so.0
  30   Thread 0x7f431b4fe700 (LWP 3277184) "cmsRun"         0x00007f44060de82d in __lll_lock_wait () from /lib64/libpthread.so.0
  31   Thread 0x7f431a9fd700 (LWP 3277187) "cmsRun"         0x00007f44060da022 in pthread_rwlock_rdlock () from /lib64/libpthread.so.0
  32   Thread 0x7f43085d8700 (LWP 3277214) "cmsRun"         0x00007f44060db45c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  33   Thread 0x7f4307dd7700 (LWP 3277215) "cmsRun"         0x00007f44060db45c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  34   Thread 0x7f4306bff700 (LWP 3277223) "cmsRun"         0x00007f44060db45c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  35   Thread 0x7f42e3453700 (LWP 3278200) "cmsRun"         0x00007f4405dff9b8 in nanosleep () from /lib64/libc.so.6
  36   Thread 0x7f42b0609700 (LWP 3278210) "cmsRun"         0x00007f44060de82d in __lll_lock_wait () from /lib64/libpthread.so.0
  37   Thread 0x7f42ae806700 (LWP 3278211) "cmsRun"         0x00007f44060da022 in pthread_rwlock_rdlock () from /lib64/libpthread.so.0
  38   Thread 0x7f42af207700 (LWP 3278212) "cmsRun"         0x00007f44060da022 in pthread_rwlock_rdlock () from /lib64/libpthread.so.0
  39   Thread 0x7f42afc08700 (LWP 3278213) "cmsRun"         0x00007f44060de82d in __lll_lock_wait () from /lib64/libpthread.so.0
  40   Thread 0x7f42adbff700 (LWP 3278215) "cmsRun"         0x00007f44060da455 in pthread_rwlock_wrlock () from /lib64/libpthread.so.0
  41   Thread 0x7f42acdff700 (LWP 3278216) "cmsRun"         0x00007f44060de82d in __lll_lock_wait () from /lib64/libpthread.so.0
  42   Thread 0x7f42aafff700 (LWP 3278219) "cmsRun"         0x00007f44060de82d in __lll_lock_wait () from /lib64/libpthread.so.0
  43   Thread 0x7f42a69ff700 (LWP 3278250) "cmsRun"         0x00007f44060db7aa in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  44   Thread 0x7f425107f700 (LWP 3278320) "cmsRun"         0x00007f44060ddda6 in do_futex_wait.constprop () from /lib64/libpthread.so.0
  45   Thread 0x7f40df182700 (LWP 3343401) "cmsRun"         0x00007f4405e35027 in epoll_wait () from /lib64/libc.so.6

(I also have the full stack trace for all thread here)

From the look of it:

  • a large number of threads were stuck waiting for a mutex inside a CUDA call, for example:
    Thread 31 (Thread 0x7f431a9fd700 (LWP 3277187) "cmsRun"):
    #0  0x00007f44060da022 in pthread_rwlock_rdlock () from /lib64/libpthread.so.0
    #1  0x00007f43e75cde86 in ?? () from /lib64/libcuda.so.1
    #2  0x00007f43e72eb660 in ?? () from /lib64/libcuda.so.1
    #3  0x00007f43e73bc309 in ?? () from /lib64/libcuda.so.1
    #4  0x00007f43e8f2a224 in ?? () from /opt/offline/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_14_0_5_patch1/external/el8_amd64_gcc12/lib/libcudart.so.12
    #5  0x00007f43e8f66650 in cudaEventSynchronize () from /opt/offline/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_14_0_5_patch1/external/el8_amd64_gcc12/lib/libcudart.so.12
    
  • other threads were stuck waiting for the CachingAllocator's lock, for example:
    Thread 25 (Thread 0x7f431f3ff700 (LWP 3277171) "cmsRun"):
    #0  0x00007f44060de82d in __lll_lock_wait () from /lib64/libpthread.so.0
    #1  0x00007f44060d7ad9 in pthread_mutex_lock () from /lib64/libpthread.so.0
    #2  0x00007f43a07226f6 in cms::alpakatools::CachingAllocator<alpaka::DevCpu, alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false> >::free(void*) () from /opt/offline/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_14_0_5_patch1/lib/el8_amd64_gcc12/pluginRecoLocalCaloEcalRecProducersPluginsPortableCudaAsync.so
    

From a first look I didn't spot anything suspicious - it really looked like the CUDA runtime was somehow hanging, or maybe our code was waiting for an event that would never complete.

@mmusich has re-run over the files one job was processing when it got stuck, and he reproduced the same crash as the rest.

So maybe the stuck jobs are a different symptom of the same underlying problem 🤷🏻‍♂️

@mmusich
Copy link
Contributor

mmusich commented Apr 19, 2024

had seen no crash nor stuck jobs (as they were originally reported from the online operations team) so far, so the fix seems to be effective.

while reviewing more in details the whole list of error streamer files I've found another crash not cured by CMSSW_14_0_5_patch2, more details at #44786

@makortel
Copy link
Contributor

Thanks @fwyzard for the details. Looking at the full stack trace

  • Thread 7 is calling cudaEventQuery() via make_host_buffer, and is probably holding the lock of CachingAllocator<DevCpu>, that the threads 42, 41, 36, 25, 21, 16, 14, 13, 12, 6 are waiting on (all the threads waiting on caching allocator were waiting on the host allocator specifically)
  • All remaining threads were doing some CUDA API call and trying to acquire either a read or write access to a read-write lock

I was surprised to see 4 threads (31, 19, 15, 5) calling cudaEventSynchronize() via EDProducer::produce() call, and will take a closer look on the related code. But given that those threads were trying to acquire (a read access on) the lock, the situation doesn't seem like "waiting for an event that will never happen" would be an obvious cause (but can't be fully excluded either).

it really looked like the CUDA runtime was somehow hanging

I agree. Maybe thread 4 (or/and 3?) is (are?) holding the lock(s?) that cmsRun threads are also waiting to lock.

@makortel
Copy link
Contributor

I was surprised to see 4 threads (31, 19, 15, 5) calling cudaEventSynchronize() via EDProducer::produce() call, and will take a closer look on the related code.

This behavior was caused by AlpakaBackendProducer by not producing any device-side data products. #44841 avoids the alpaka::wait() call (that lead to cudaEventSynchronize()) in that case. The PR also avoids some further alpaka::wait() calls that I realized to be unnecessary.

@mmusich
Copy link
Contributor

mmusich commented May 7, 2024

+hlt

@makortel
Copy link
Contributor

makortel commented May 7, 2024

+heterogeneous

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants