Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HLT Crash -- Run 362720 -- Possibly related to tracking? #40174

Closed
trtomei opened this issue Nov 29, 2022 · 27 comments
Closed

HLT Crash -- Run 362720 -- Possibly related to tracking? #40174

trtomei opened this issue Nov 29, 2022 · 27 comments

Comments

@trtomei
Copy link
Contributor

trtomei commented Nov 29, 2022

Dearest,

HLT crashes were observed in Run 362720. The messages seemed related to Tracking:

**#5** 0x00007efca351ca97 in void std::__introsort_loop<__gnu_cxx::__normal_iterator >, long, __gnu_cxx::__ops::_Iter_comp_iter >(__gnu_cxx::__normal_iterator > >, __gnu_cxx::__normal_iterator > >, long, __gnu_cxx::__ops::_Iter_comp_iter) () from /opt/offline/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_11/lib/el8_amd64_gcc10/libRecoTrackerTkHitPairs.so  
**#6** 0x00007efca351be9e in RecHitsSortedInPhi::RecHitsSortedInPhi(std::vector<BaseTrackerRecHit const*, std::allocator const&, Point3DBase const&, DetLayer const*) () from /opt/offline/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_11/lib/el8_amd64_gcc10/libRecoTrackerTkHitPairs.so  
**#7** 0x00007efca351846c in LayerHitMapCache::operator()(SeedingLayerSetsHits::SeedingLayer const&, TrackingRegion const&) () from /opt/offline/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_11/lib/el8_amd64_gcc10/libRecoTrackerTkHitPairs.so  
**#8** 0x00007efca35163ea in HitPairGeneratorFromLayerPair::doublets(TrackingRegion const&, edm::Event const&, edm::EventSetup const&, SeedingLayerSetsHits::SeedingLayer const&, SeedingLayerSetsHits::SeedingLayer const&, LayerHitMapCache&) () from /opt/offline/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_11/lib/el8_amd64_gcc10/libRecoTrackerTkHitPairs.so  
**#9** 0x00007efc155ab90b in (anonymous namespace)::Impl<(anonymous namespace)::DoNothing, (anonymous namespace)::ImplIntermediateHitDoublets, (anonymous namespace)::RegionsLayersSeparate>::produce(bool, edm::Event&, edm::EventSetup const&) () from /opt/offline/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_11/lib/el8_amd64_gcc10/pluginRecoTrackerTkHitPairsPlugins.so

but I couldn't reproduce them.

Instructions to run on Hilton (hilton_c2b02_44_01) as hltpro

ssh hilton-c2b02-44-01
sudo -u hltpro -i
cd ${HOSTNAME//-/_}/hltpro
source setup.sh -r CMSSW_12_4_11_patch2
cd error
cmsRun hlt_run362720.py

Instructions to run on a regular lxplus machine.

cmsrel CMSSW_12_4_11_patch2
cd CMSSW_12_4_11_patch2/src
cmsenv
cp /afs/cern.ch/user/t/tomei/public/crashRun362720/hlt_run362720.py .
cmsRun hlt_run362720.py

@cms-sw/hlt-l2 FYI
@cms-sw/tracking-dpg-l2 FYI

@cmsbuild
Copy link
Contributor

A new Issue was created by @trtomei Thiago Tomei.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@dan131riley
Copy link

@trtomei is there a full log available?

@makortel
Copy link
Contributor

assign reconstruction

FYI @cms-sw/hlt-l2 @cms-sw/tracking-pog-l2

@cmsbuild
Copy link
Contributor

New categories assigned: reconstruction

@mandrenguyen,@clacaputo you have been requested to review this Pull request/Issue and eventually sign? Thanks

@missirol
Copy link
Contributor

missirol commented Nov 29, 2022

Let me add some info, since I don't think this is the first occurrence of this HLT crash this year (should have opened an issue a while back).

  • Similar crashes occurred in the following runs (and maybe more runs I missed): 356945 (Aug-6, 20 crashes), 357611 (Aug-17, 33 crashes), 357754 (Aug-21, 19 crashes), 360874 (Oct-21, 3 crashes), and now 362720 (Nov-26, 7 crashes). One of these crashes was discussed on Aug-18 in the "AlCa+HLT OPS" channel of the "CMS Online Ops" team on Mattermost.

  • Part of those stack traces can still be found on the F3Mon DAQ webpage (but I think this can only be accessed from the P5 network, or via a tunnel to it). This log file contains the available stack traces from the crashes in run-362720.

  • In all these cases, the crashes came in a burst: separate HLT nodes experienced the same type of crash roughly at the same time (within the same minute, or even less).

  • As far as I know, no one managed to reproduce any of these crashes offline.

  • Somewhat unrelated: a similar error message was discussed in buffer overflow in RecHitsSortedInPhi on NaN #35489 and Segfaults in workflow 11834.21 step 2 #36336 (but those issues were reproducible, and were fixed by Change to SiStripCluster DataFormat: use auto for return types #36852, which was integrated long before 2022 data-taking started)

FYI: @fwyzard (I was discussing this with Andrea at some point in the past)

@makortel
Copy link
Contributor

Thanks @missirol. So in all cases in the log file the crashing module is HitPairEDProducer:hltDisplacedhltIter4PFlowPixelLessHitDoubletsForTau.

I guess the crash occurs inside this std::sort()

std::sort(theHits.begin(), theHits.end(), HitLessPhi());

with the HitLessPhi being
struct HitLessPhi {
bool operator()(const HitWithPhi& a, const HitWithPhi& b) { return a.phi() < b.phi(); }
};

An easy way to get sort() to crash is to have a NaN, but I don't know how likely that would be here.

@missirol
Copy link
Contributor

assign hlt

(To make sure this remains on HLT's radar.)

@cmsbuild
Copy link
Contributor

New categories assigned: hlt

@missirol,@Martin-Grunewald you have been requested to review this Pull request/Issue and eventually sign? Thanks

@missirol
Copy link
Contributor

Following up on this issue, as we still see HLT crashes similar to this one.

These appeared again in run-367112, run-367337 and run-367553 (all pp collisions runs). The CMSSW release at HLT then was CMSSW_13_0_5_patch1.

The corresponding stack traces from DAQ are attached below, and the corresponding error-stream files are available on EOS.
f3mon_run367112.txt
f3mon_run367337.txt
f3mon_run367553.txt

First attempts to reproduce the crashes offline failed (again). The recipe used in those failed attempts was adapted in [1] to be valid for lxplus and lxplus-gpu.

FYI: @silviodonato @fwyzard @mzarucki

[1]

#!/bin/bash

# cmsrel CMSSW_13_0_5_patch1
# cd CMSSW_13_0_5_patch1
# cmsenv
# # save this file as test.sh
# chmod u+x test.sh
# ./test.sh 367553 4 # runNumber nThreads

[ $# -eq 2 ] || exit 1

RUNNUM="${1}"
NUMTHREADS="${2}"

ERRDIR=/eos/cms/store/group/dpg_trigger/comm_trigger/TriggerStudiesGroup/FOG/error_stream
RUNDIR="${ERRDIR}"/run"${RUNNUM}"

for dirPath in $(ls -d "${RUNDIR}"*); do
  # require at least one non-empty FRD file
  [ $(cd "${dirPath}" ; find -maxdepth 1 -size +0 | grep .raw | wc -l) -gt 0 ] || continue
  runNumber="${dirPath: -6}"
  JOBTAG=test_run"${runNumber}"
  HLTMENU="--runNumber ${runNumber}"
  hltConfigFromDB ${HLTMENU} > "${JOBTAG}".py
  cat <<EOF >> "${JOBTAG}".py
process.options.numberOfThreads = ${NUMTHREADS}
process.options.numberOfStreams = 0
process.hltOnlineBeamSpotESProducer.timeThreshold = int(1e6)
del process.PrescaleService
del process.MessageLogger
process.load('FWCore.MessageService.MessageLogger_cfi')
import os
import glob
process.source.fileListMode = True
process.source.fileNames = sorted([foo for foo in glob.glob("${dirPath}/*raw") if os.path.getsize(foo) > 0])
process.EvFDaqDirector.buBaseDir = "${ERRDIR}"
process.EvFDaqDirector.runNumber = ${runNumber}
process.hltDQMFileSaverPB.runNumber = ${runNumber}
# remove paths containing OutputModules
streamPaths = [pathName for pathName in process.finalpaths_()]
for foo in streamPaths:
    process.__delattr__(foo)
EOF
  rm -rf run"${runNumber}"
  mkdir run"${runNumber}"
  echo "run${runNumber} .."
  cmsRun "${JOBTAG}".py &> "${JOBTAG}".log
  echo "run${runNumber} .. done (exit code: $?)"
  unset runNumber
done
unset dirPath

@missirol
Copy link
Contributor

missirol commented Jun 5, 2023

Reporting another instance of these HLT crashes in run-368489 (3 crashes).

What is interesting about this kind of crashes is that they often come in bursts (multiple crashes within a few seconds on different HLT nodes). The full stack trace of the latest crashes is attached (includes metadata).

f3mon_run368489.txt

@missirol
Copy link
Contributor

Reporting another HLT crash of this kind. This time, there was only one crash of this kind in this run.

  • Run 368566 (pp collisions)
  • Release: CMSSW_13_0_7
  • Full log from DAQ: f3mon_run368566.txt (3rd crash in the log)
  • Piece of stack trace:
Thread 8 (Thread 0x7f3a367ff700 (LWP 1468447) "cmsRun"):
#0  0x00007f3aada56a71 in poll () from /lib64/libc.so.6
#1  0x00007f3aa6af046f in full_read.constprop () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#2  0x00007f3aa6abbb6c in edm::service::InitRootHandlers::stacktraceFromThread() () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#3  0x00007f3aa6abc33b in sig_dostack_then_abort () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00007f3a4d892777 in void std::__introsort_loop<__gnu_cxx::__normal_iterator<RecHitsSortedInPhi::HitWithPhi*, std::vector<RecHitsSortedInPhi::HitWithPhi, std::allocator<RecHitsSortedInPhi::HitWithPhi> > >, long, __gnu_cxx::__ops::_I\
ter_comp_iter<RecHitsSortedInPhi::HitLessPhi> >(__gnu_cxx::__normal_iterator<RecHitsSortedInPhi::HitWithPhi*, std::vector<RecHitsSortedInPhi::HitWithPhi, std::allocator<RecHitsSortedInPhi::HitWithPhi> > >, __gnu_cxx::__normal_iterator<Re\
cHitsSortedInPhi::HitWithPhi*, std::vector<RecHitsSortedInPhi::HitWithPhi, std::allocator<RecHitsSortedInPhi::HitWithPhi> > >, long, __gnu_cxx::__ops::_Iter_comp_iter<RecHitsSortedInPhi::HitLessPhi>) () from /opt/offline/el8_amd64_gcc11/\
cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libRecoTrackerTkHitPairs.so
#6  0x00007f3a4d894bf5 in RecHitsSortedInPhi::RecHitsSortedInPhi(std::vector<BaseTrackerRecHit const*, std::allocator<BaseTrackerRecHit const*> > const&, Point3DBase<float, GlobalTag> const&, DetLayer const*) () from /opt/offline/el8_amd\
64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libRecoTrackerTkHitPairs.so
#7  0x00007f3a4d898d11 in LayerHitMapCache::operator()(SeedingLayerSetsHits::SeedingLayer const&, TrackingRegion const&) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libRecoTrackerTkHitPairs.so
#8  0x00007f3a4d89907a in HitPairGeneratorFromLayerPair::doublets(TrackingRegion const&, edm::Event const&, edm::EventSetup const&, SeedingLayerSetsHits::SeedingLayer const&, SeedingLayerSetsHits::SeedingLayer const&, LayerHitMapCache&) \
() from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libRecoTrackerTkHitPairs.so
#9  0x00007f39a2390d97 in (anonymous namespace)::Impl<(anonymous namespace)::DoNothing, (anonymous namespace)::ImplIntermediateHitDoublets, (anonymous namespace)::RegionsLayersSeparate>::produce(bool, edm::Event&, edm::EventSetup const&)\
 () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerTkHitPairsPlugins.so
#10 0x00007f3ab04a395d in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gc\
c11/libFWCoreFramework.so

(..)

Current Modules:
Module: HitPairEDProducer:hltDisplacedhltIter4PFlowPixelLessHitDoubletsForTau (crashed)
Module: SeedCreatorFromRegionConsecutiveHitsEDProducer:hltElePixelSeedsDoubletsUnseeded
Module: PFBlockProducer:hltPixelOnlyParticleFlowBlock
Module: CkfTrackCandidateMaker:hltIterL3OITrackCandidates
Module: PixelTrackProducerFromSoAPhase1:hltPixelTracks
Module: none
Module: PFBlockProducer:hltParticleFlowBlock
Module: ElectronNHitSeedProducer:hltEgammaElectronPixelSeeds
Module: PixelTrackProducerFromSoAPhase1:hltPixelTracks
Module: none
Module: SiPixelRawToClusterCUDA:hltSiPixelClustersGPU
Module: PFClusterProducer:hltParticleFlowClusterHBHE
Module: HLTEcalRecHitInAllL1RegionsProducer:hltRechitInRegionsES
Module: ElectronNHitSeedProducer:hltEgammaElectronPixelSeeds
Module: CkfTrackCandidateMaker:hltIterL3OITrackCandidates
Module: HLTCaloJetIDProducer:hltAK4CaloJetsIDPassed
Module: CkfTrackCandidateMaker:hltIter0PFlowCkfTrackCandidates
Module: none
Module: CSCSegmentProducer:hltCscSegments
Module: L3MuonProducer:hltL3MuonsIterL3OI
Module: PixelTrackProducerFromSoAPhase1:hltPixelTracks
Module: EcalUncalibRecHitProducerGPU:hltEcalUncalibRecHitGPU
Module: CkfTrackCandidateMaker:hltIter0PFlowCkfTrackCandidates
Module: SeedGeneratorFromProtoTracksEDProducer:hltIter0PFLowPixelSeedsFromPixelTracks
Module: none
Module: ElectronNHitSeedProducer:hltEgammaElectronPixelSeedsUnseeded
Module: none
Module: L2MuonProducer:hltL2Muons
Module: PFRecHitProducer:hltParticleFlowRecHitHBHE
Module: CkfTrackCandidateMaker:hltIterL3OITrackCandidatesNoVtx
Module: CkfTrackCandidateMaker:hltIterL3OITrackCandidates
Module: CkfTrackCandidateMaker:hltIter0PFlowCkfTrackCandidates
A fatal system signal has occurred: segmentation violation

@missirol
Copy link
Contributor

missirol commented Jun 11, 2023

Reporting another HLT crash of this kind. This time, three crashes of this kind within a few seconds of each other, on different HLT nodes, in the same run.

  • Run 368685 (pp collisions)
  • Release: CMSSW_13_0_7
  • Full log from DAQ: f3mon_run368685.txt (see first 3 crashes in the log)

@missirol
Copy link
Contributor

missirol commented Jun 20, 2023

There is a reproducer [*] which might be related to this issue. The reproducer was tested with CMSSW_13_0_7.

Some things are still unclear to me, though. The reproducer comes from run-368685, which had 4 crashes (see previous comment). Three of the four crashes mentioned RecHitsSortedInPhi in the thread that issued sig_dostack_then_abort; on the other hand, the reproducer uses the file corresponding to the 4th crash (the remaining crash, where RecHitsSortedInPhi does not follow sig_dostack_then_abort, although RecHitsSortedInPhi is mentioned in one of the other threads), and locally this crashes in a reproducible way with a stack trace that sometimes mentions RecHitsSortedInPhi after sig_dostack_then_abort. As far as I checked, the reproducer crashes only when using the GPU reconstruction (like in production at HLT), e.g. on lxplus8-gpu.

  • Log from DAQ for run-368685: f3mon_run368685.txt
  • Log from reproducer [*] on lxplus8-gpu: hltTest368685_1.log (RecHitsSortedInPhi follows sig_dostack_then_abort)
  • Log from a similar reproducer, still on lxplus8-gpu: hltTest368685_2.log (RecHitsSortedInPhi does not follow sig_dostack_then_abort, but again it is mentioned in one of the other threads)

[*]

#!/bin/bash

# cmsrel CMSSW_13_0_7
# cd CMSSW_13_0_7/src
# cmsenv

OUTFILE=hltTest368685
ACCELER=gpu-nvidia

hltGetConfiguration run:368685 \
 --data \
 --no-prescale \
 --no-output \
 --globaltag 130X_dataRun3_HLT_v2 \
 --max-events 1 \
 --paths HLT_DoubleMediumChargedIsoDisplacedPFTauHPS32_Trk1_eta2p1_v* \
 --input file:/eos/cms/store/group/dpg_trigger/comm_trigger/TriggerStudiesGroup/FOG/error_stream/root/run368685/run368685_ls1009_index000027_fu-c2b02-41-01_pid1894024.root \
 > "${OUTFILE}".py

cat <<@EOF >> "${OUTFILE}".py
del process.MessageLogger
process.load('FWCore.MessageService.MessageLogger_cfi')

process.options.numberOfThreads = 1
process.options.accelerators = ['${ACCELER}']

process.source.skipEvents = cms.untracked.uint32(64)
process.maxEvents.input = 1
@EOF

cmsRun "${OUTFILE}".py &> "${OUTFILE}".log

@missirol
Copy link
Contributor

(I edited the reproducer in #40174 (comment) to simplify it slightly.)

Based on the reproducer, the first NaN I can find appears in

LocalPoint projectedPos = lPos - ldir * delta / ldir.z();

because there is one case where ldir.z() == 0 [1] [2]. The crash occurs only with GPU offloading (no crash with CPU only) [3].

The module that's crashing is hltDisplacedhltIter4PFlowPixelLessHitDoubletsForTau, which is used only in the Path HLT_DoubleMediumChargedIsoDisplacedPFTauHPS32_Trk1_eta2p1_v* (FYI: @sarafiorendi). The content of the Path can be checked here.

Could experts please have a look ?

@cms-sw/reconstruction-l2 @cms-sw/tracking-pog-l2

[1]

diff --git a/RecoTracker/MeasurementDet/plugins/RecHitPropagator.h b/RecoTracker/MeasurementDet/plugins/RecHitPropagator.h
index f740faff365..529d4ac1d10 100644
--- a/RecoTracker/MeasurementDet/plugins/RecHitPropagator.h
+++ b/RecoTracker/MeasurementDet/plugins/RecHitPropagator.h
@@ -2,6 +2,7 @@
 #define RecHitPropagator_H
 
 #include "TrackingTools/TrajectoryState/interface/TrajectoryStateOnSurface.h"
+#include "FWCore/MessageLogger/interface/MessageLogger.h"
 
 class TrackingRecHit;
 class MagneticField;
@@ -18,12 +19,17 @@ public:
 
 // propagate from glued to mono/stereo
 inline TrajectoryStateOnSurface fastProp(const TrajectoryStateOnSurface& ts, const Plane& oPlane, const Plane& tPlane) {
+edm::LogPrint("RecHitPropagator") << " RecHitPropagator::fastProp " << __LINE__ << " ts.isValid=" << ts.isValid() << " ts.globalPosition=" << ts.globalPosition();
+edm::LogPrint("RecHitPropagator") << " RecHitPropagator::fastProp " << __LINE__ << " oPlane.position=" << oPlane.position() << " tPlane.position=" << tPlane.position();
   GlobalVector gdir = ts.globalMomentum();
-
+edm::LogPrint("RecHitPropagator") << " RecHitPropagator::fastProp " << __LINE__ << " gdir=" << gdir;
   double delta = tPlane.localZ(oPlane.position());
+edm::LogPrint("RecHitPropagator") << " RecHitPropagator::fastProp " << __LINE__ << " delta=" << delta;
   LocalVector ldir = tPlane.toLocal(gdir);  // fast prop!
+edm::LogPrint("RecHitPropagator") << " RecHitPropagator::fastProp " << __LINE__ << " ldir=" << ldir;
   LocalPoint lPos = tPlane.toLocal(ts.globalPosition());
   LocalPoint projectedPos = lPos - ldir * delta / ldir.z();
+edm::LogPrint("RecHitPropagator") << " RecHitPropagator::fastProp " << __LINE__ << " lPos=" << lPos << " projectedPos=" << projectedPos;
   // we can also patch it up as only the position-errors are used...
   GlobalTrajectoryParameters gp(
       tPlane.toGlobal(projectedPos), gdir, ts.charge(), &ts.globalParameters().magneticField());

[2] Output on GPU:

 RecHitPropagator::fastProp 22 ts.isValid=1 ts.globalPosition= (4.50529e+08,2.60508e+08,3.15626e+08) 
 RecHitPropagator::fastProp 23 oPlane.position= (21.7863,27.3469,14.7851)  tPlane.position= (21.7116,27.6217,14.7126) 
 RecHitPropagator::fastProp 25 gdir= (33.2922,19.2505,23.3234) 
 RecHitPropagator::fastProp 27 delta=0.291381
 RecHitPropagator::fastProp 29 ldir= (30.8843,32.697,0) 
 RecHitPropagator::fastProp 32 lPos= (4.17944e+08,4.42474e+08,8)  projectedPos= (-inf,-inf,-nan) 


A fatal system signal has occurred: segmentation violation
The following is the call stack containing the origin of the signal.

(..)

Current Modules:

Module: HitPairEDProducer:hltDisplacedhltIter4PFlowPixelLessHitDoubletsForTau (crashed)
Module: none

A fatal system signal has occurred: segmentation violation

[3] Output on CPU:

RecHitPropagator::fastProp 22 ts.isValid=1 ts.globalPosition= (2.25265e+08,1.30254e+08,1.57813e+08) 
RecHitPropagator::fastProp 23 oPlane.position= (21.7863,27.3469,14.7851)  tPlane.position= (21.7116,27.6217,14.7126) 
RecHitPropagator::fastProp 25 gdir= (33.2922,19.2504,23.3234) 
RecHitPropagator::fastProp 27 delta=0.291381
RecHitPropagator::fastProp 29 ldir= (30.8843,32.697,-9.53674e-07) 
RecHitPropagator::fastProp 32 lPos= (2.08972e+08,2.21237e+08,-4)  projectedPos= (2.18408e+08,2.31227e+08,-4.29138) 

@slava77
Copy link
Contributor

slava77 commented Jun 23, 2023

ts.globalPosition= (2.25265e+08,1.30254e+08,1.57813e+08)

It may be more practical to start upstream at the input position to be at least in the CMS cavern, or perhaps a bit more local.

@missirol
Copy link
Contributor

missirol commented Jul 5, 2023

ts.globalPosition= (2.25265e+08,1.30254e+08,1.57813e+08)

It may be more practical to start upstream at the input position to be at least in the CMS cavern, or perhaps a bit more local.

Digging a bit deeper (for reference, missirol@bd3a869 contains some untidy printouts used on top of #40174 (comment)), I see that (2.25265e+08,1.30254e+08,1.57813e+08) originates from

x = LocalPoint(x.x() + (p.x() / p.z()) * s, x.y() + (p.y() / p.z()) * s, x.z() + s);

because p.z() == -9.53674e-07. The use of StraightLinePropagator in this case comes from
StraightLinePropagator prop(theField, alongMomentum);

After discussing offline with @mmusich, the simplest fix I could come up with is [*] (this avoids the crash in the reproducer), but obviously this should be reviewed by experts.

It might also be useful to review this particular trigger Path, and understand why it is prone to this issue.

[*]

diff --git a/RecoTracker/MeasurementDet/plugins/RecHitPropagator.h b/RecoTracker/MeasurementDet/plugins/RecHitPropagator.h
index f740faff365..ad4411b1374 100644
--- a/RecoTracker/MeasurementDet/plugins/RecHitPropagator.h
+++ b/RecoTracker/MeasurementDet/plugins/RecHitPropagator.h
@@ -19,9 +19,14 @@ public:
 // propagate from glued to mono/stereo
 inline TrajectoryStateOnSurface fastProp(const TrajectoryStateOnSurface& ts, const Plane& oPlane, const Plane& tPlane) {
   GlobalVector gdir = ts.globalMomentum();
+  LocalVector ldir = tPlane.toLocal(gdir);  // fast prop!
+
+  // if ldir.z() == 0, return an invalid TrajectoryStateOnSurface
+  if (ldir.z() == 0) {
+    return TrajectoryStateOnSurface();
+  }
 
   double delta = tPlane.localZ(oPlane.position());
-  LocalVector ldir = tPlane.toLocal(gdir);  // fast prop!
   LocalPoint lPos = tPlane.toLocal(ts.globalPosition());
   LocalPoint projectedPos = lPos - ldir * delta / ldir.z();
   // we can also patch it up as only the position-errors are used...
diff --git a/RecoTracker/MeasurementDet/plugins/doubleMatch.icc b/RecoTracker/MeasurementDet/plugins/doubleMatch.icc
index ab45ef8d6c6..ff1a5efb34e 100644
--- a/RecoTracker/MeasurementDet/plugins/doubleMatch.icc
+++ b/RecoTracker/MeasurementDet/plugins/doubleMatch.icc
@@ -82,7 +82,8 @@ void TkGluedMeasurementDet::doubleMatch(const TrajectoryStateOnSurface& ts,
     if LIKELY (!emptyMono) {
       // mono does require "projection" for precise estimate
       TrajectoryStateOnSurface mts = fastProp(ts, geomDet().surface(), theMonoDet->geomDet().surface());
-      theMonoDet->simpleRecHits(mts, collector.estimator(), data, monoHits);
+      if LIKELY (mts.isValid())
+        theMonoDet->simpleRecHits(mts, collector.estimator(), data, monoHits);
     }
     // print("mono", mts,ts);
     mf = monoHits.size();
@@ -96,7 +97,8 @@ void TkGluedMeasurementDet::doubleMatch(const TrajectoryStateOnSurface& ts,
     emptyStereo = theStereoDet->empty(data);
     if LIKELY (!emptyStereo) {
       TrajectoryStateOnSurface pts = fastProp(ts, geomDet().surface(), theStereoDet->geomDet().surface());
-      theStereoDet->simpleRecHits(pts, collector.estimator(), data, stereoHits);
+      if LIKELY (pts.isValid())
+        theStereoDet->simpleRecHits(pts, collector.estimator(), data, stereoHits);
       // print("stereo", pts,ts);
     }
     sf = stereoHits.size();

@slava77
Copy link
Contributor

slava77 commented Jul 6, 2023

It seems OK to apply your changes to avoid the crash specifically here with ldir.z ==0.

However, it seems like the problem is more generic and calls to ::doubleMatch (or perhaps even ::collectRecHits() with poorly defined TSOS are better be made to skip/return. The condition could be something like either of x,y,z being above 2e3 or so.
Better yet, a skip could be added in the earlier non-generic (detector-aware) caller.

Without it I wouldn't be surprised that some other call taking a ts.globalPosition= (2.25265e+08,1.30254e+08,1.57813e+08) would lead to another NaN and a crash.

@missirol
Copy link
Contributor

missirol commented Jul 7, 2023

@slava77 , thanks for the suggestions in #40174 (comment), I can try to improve the patch.

@missirol
Copy link
Contributor

missirol commented Jul 7, 2023

Just reporting one more crash of this kind in run-370093.

The log from DAQ mentions RecHitsSortedInPhi only under sig_pause_for_stacktrace. Offline, the crash is reproducible [1], the stack trace is similar to the ones discussed in this issue [2].

[1]
https://gist.github.com/missirol/aa4da3b3424e4f06ba5d8b9a4d904040

source rerun_hlt_on_error_stream.sh -r 370093 -o out_run370093 -t 1 \
 -i /eos/cms/store/group/dpg_trigger/comm_trigger/TriggerStudiesGroup/FOG/error_stream/run370093

[2]

Thread 1 (Thread 0x7f8811403640 (LWP 618260) "cmsRun"):
#0  0x00007f8811fcc0e1 in poll () from /lib64/libc.so.6
#1  0x00007f8808c8546f in full_read.constprop () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_9/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#2  0x00007f8808c50b6c in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_9/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#3  0x00007f8808c5133b in sig_dostack_then_abort () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_9/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00007f87b279d064 in RecHitsSortedInPhi::RecHitsSortedInPhi(std::vector<BaseTrackerRecHit const*, std::allocator<BaseTrackerRecHit const*> > const&, Point3DBase<float, GlobalTag> const&, DetLayer const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_9/lib/el8_amd64_gcc11/libRecoTrackerTkHitPairs.so
#6  0x00007f87b27a0d01 in LayerHitMapCache::operator()(SeedingLayerSetsHits::SeedingLayer const&, TrackingRegion const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_9/lib/el8_amd64_gcc11/libRecoTrackerTkHitPairs.so
#7  0x00007f87b27a108f in HitPairGeneratorFromLayerPair::doublets(TrackingRegion const&, edm::Event const&, edm::EventSetup const&, SeedingLayerSetsHits::SeedingLayer const&, SeedingLayerSetsHits::SeedingLayer const&, LayerHitMapCache&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_9/lib/el8_amd64_gcc11/libRecoTrackerTkHitPairs.so
#8  0x00007f8730aaad97 in (anonymous namespace)::Impl<(anonymous namespace)::DoNothing, (anonymous namespace)::ImplIntermediateHitDoublets, (anonymous namespace)::RegionsLayersSeparate>::produce(bool, edm::Event&, edm::EventSetup const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_9/lib/el8_amd64_gcc11/pluginRecoTrackerTkHitPairsPlugins.so
#9  0x00007f88149e895d in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_9/lib/el8_amd64_gcc11/libFWCoreFramework.so
#10 0x00007f88149cf072 in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_9/lib/el8_amd64_gcc11/libFWCoreFramework.so
#11 0x00007f881495b6da in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::exception_ptr, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_9/lib/el8_amd64_gcc11/libFWCoreFramework.so
#12 0x00007f881495bb88 in edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute() () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_9/lib/el8_amd64_gcc11/libFWCoreFramework.so
#13 0x00007f88146b0f79 in tbb::detail::d1::function_task<edm::WaitingTaskList::announce()::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_9/lib/el8_amd64_gcc11/libFWCoreConcurrency.so
#14 0x00007f88131329cd in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (waiter=..., t=0x7f87242f4100, this=0x7f880fcc3e00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_0_2-el8_amd64_gcc11/build/CMSSW_13_0_2-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-bb5e0283c68ca6d69bd8419f6c08f7b1/tbb-v2021.8.0/src/tbb/task_dispatcher.h:322
#15 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (waiter=..., t=<optimized out>, this=0x7f880fcc3e00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_0_2-el8_amd64_gcc11/build/CMSSW_13_0_2-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-bb5e0283c68ca6d69bd8419f6c08f7b1/tbb-v2021.8.0/src/tbb/task_dispatcher.h:458
#16 tbb::detail::r1::task_dispatcher::execute_and_wait (t=<optimized out>, wait_ctx=..., w_ctx=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_0_2-el8_amd64_gcc11/build/CMSSW_13_0_2-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-bb5e0283c68ca6d69bd8419f6c08f7b1/tbb-v2021.8.0/src/tbb/task_dispatcher.cpp:168
#17 0x00007f88148de10d in edm::FinalWaitingTask::wait() () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_9/lib/el8_amd64_gcc11/libFWCoreFramework.so
#18 0x00007f88148ebb91 in edm::EventProcessor::processRuns() () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_9/lib/el8_amd64_gcc11/libFWCoreFramework.so
#19 0x00007f88148ec0e6 in edm::EventProcessor::runToCompletion() () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_9/lib/el8_amd64_gcc11/libFWCoreFramework.so
#20 0x000000000040c17d in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()
#21 0x00007f8813120847 in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_0_2-el8_amd64_gcc11/build/CMSSW_13_0_2-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-bb5e0283c68ca6d69bd8419f6c08f7b1/tbb-v2021.8.0/src/tbb/arena.cpp:694
#22 0x00000000004101d0 in main::{lambda()#1}::operator()() const ()
#23 0x0000000000407d6c in main ()

@missirol
Copy link
Contributor

However, it seems like the problem is more generic and calls to ::doubleMatch (or perhaps even ::collectRecHits() with poorly defined TSOS are better be made to skip/return. The condition could be something like either of x,y,z being above 2e3 or so. Better yet, a skip could be added in the earlier non-generic (detector-aware) caller.

Thanks to clarifications received offline from @slava77, I tried to address his comment in missirol@6a18b81 (this fixes the two reproducers found thus far). Some details in [*].

To summarise, missirol@269dbd9 and missirol@6a18b81 are what I could come up with, as a layman of tracking. My goal would be to fix the crashes seen at HLT, so it would be important to be able to backport the fix (whatever it will be).

How to proceed ?


[*] The rationale of missirol@6a18b81 is that LayerMeasurements seemed like the earliest "generic" point where one could catch the "bad" TSOS (at least for the crash in question). This "bad" position could be caught earlier, but this would likely require touching a long list of detector-specific classes (meaning, it might be better to not have "bad" TSOS returned by methods like DetLayer::compatibleDets, but this means touching the many derived classes of DetLayer, as far as I understand). The implementation of layerIsCompatibleWithTSOS in missirol@6a18b81 is completely arbitrary, just meant as an example (to be improved).

@slava77
Copy link
Contributor

slava77 commented Jul 18, 2023

To summarise, missirol@269dbd9 and missirol@6a18b81 are what I could come up with, as a layman of tracking. My goal would be to fix the crashes seen at HLT, so it would be important to be able to backport the fix (whatever it will be).

How to proceed ?

both look reasonable to me. Perhaps the next step is to make a PR to the master branch and see if any differences show up in the PR tests.

@missirol
Copy link
Contributor

Just for the record, I tested the reproducer in #40174 (comment) using a more recent HLT menu (i.e. /dev/CMSSW_13_0_0/GRun/V147). The latter includes CMSHLT-2820, where the trigger in question (displaced-PFTau trigger) was updated by tightening some cuts. Unfortunately, the updated trigger is not enough to avoid the crash.

@slava77
Copy link
Contributor

slava77 commented Jul 29, 2023

#42318 and #42320 were merged.
Is this issue still active, or was it resolved with the above?

@slava77
Copy link
Contributor

slava77 commented Jul 29, 2023

type tracking

@missirol
Copy link
Contributor

+hlt

@missirol
Copy link
Contributor

Is this issue still active, or was it resolved with the above?

I summarised my understanding in #40174 (comment). I think we can close this issue. If similar crashes reappear, we can re-open it, or create a new one.

I'll let RECO conveners comment/sign.

@missirol
Copy link
Contributor

missirol commented Aug 7, 2023

please close

I think we can close this issue. If similar crashes reappear, we can re-open it, or create a new one.

Going ahead with closing this issue.

@cmsbuild cmsbuild closed this as completed Aug 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants