igprof pp segfault in 12_3_0, 12_3_0_pre4 in the Run3 reco step #37816

jpata · 2022-05-05T11:58:11Z

In two recent releases, igprof pp crashes in the reco step in 11834.21:

$ tail -n10 /eos/cms/store/user/cmsbuild/profiling/data/CMSSW_12_3_0/slc7_amd64_gcc10/11834.21/step3_igprof_cpu.txt
#14 0x00007f36c6833f45 in IgHookTrace::stacktrace (addresses=addresses@entry=0x7ffc58699700, nmax=nmax@entry=800) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_0_pre5-slc7_amd64_gcc10/build/CMSSW_12_3_0_pre5-build/BUILD/slc7_amd64_gcc10/external/igprof/5.9.16-f8a2b39c36d2a318d6c7c0f619242bdb/igprof-6cc73b59d83ed6c9d73b455dc40857e700ef6ee4/src/walk-syms.cc:175
#15 0x00007f36c683d508 in profileSignalHandler () at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_0_pre5-slc7_amd64_gcc10/build/CMSSW_12_3_0_pre5-build/BUILD/slc7_amd64_gcc10/external/igprof/5.9.16-f8a2b39c36d2a318d6c7c0f619242bdb/igprof-6cc73b59d83ed6c9d73b455dc40857e700ef6ee4/src/profile-perf.cc:66
#16 <signal handler called>
#17 0x00007f368f898ee2 in mkfit::kalmanOperation(int, Matriplex::MatriplexSym<float, 6, 4> const&, Matriplex::Matriplex<float, 6, 1, 4> const&, Matriplex::MatriplexSym<float, 3, 4> const&, Matriplex::Matriplex<float, 3, 1, 4> const&, Matriplex::MatriplexSym<float, 6, 4>&, Matriplex::Matriplex<float, 6, 1, 4>&, Matriplex::Matriplex<float, 1, 1, 4>&, int) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_0/lib/slc7_amd64_gcc10/libRecoTrackerMkFitCore.so

Current Modules:

Module: MkFitProducer:detachedTripletStepTrackCandidatesMkFit (crashed)

A fatal system signal has occurred: segmentation violation

$ tail -n10 /eos/cms/store/user/cmsbuild/profiling/data/CMSSW_12_4_0_pre3/slc7_amd64_gcc10/11834.21/step3_igprof_cpu.txt
#14 0x00007f9899389f45 in IgHookTrace::stacktrace (addresses=addresses@entry=0x7ffd8674d480, nmax=nmax@entry=800) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_4_0_pre2-slc7_amd64_gcc10/build/CMSSW_12_4_0_pre2-build/BUILD/slc7_amd64_gcc10/external/igprof/5.9.16-95dc8f7dd3ee3d76c20fd25518fc6fa9/igprof-6cc73b59d83ed6c9d73b455dc40857e700ef6ee4/src/walk-syms.cc:175
#15 0x00007f9899393508 in profileSignalHandler () at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_4_0_pre2-slc7_amd64_gcc10/build/CMSSW_12_4_0_pre2-build/BUILD/slc7_amd64_gcc10/external/igprof/5.9.16-95dc8f7dd3ee3d76c20fd25518fc6fa9/igprof-6cc73b59d83ed6c9d73b455dc40857e700ef6ee4/src/profile-perf.cc:66
#16 <signal handler called>
#17 0x00007f9860bcf8ea in mkfit::propagateHelixToZMPlex(Matriplex::MatriplexSym<float, 6, 4> const&, Matriplex::Matriplex<float, 6, 1, 4> const&, Matriplex::Matriplex<int, 1, 1, 4> const&, Matriplex::Matriplex<float, 1, 1, 4> const&, Matriplex::MatriplexSym<float, 6, 4>&, Matriplex::Matriplex<float, 6, 1, 4>&, int, mkfit::PropagationFlags, Matriplex::Matriplex<int, 1, 1, 4> const*) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_4_0_pre3/lib/slc7_amd64_gcc10/libRecoTrackerMkFitCore.so

Current Modules:

Module: MkFitProducer:highPtTripletStepTrackCandidatesMkFit (crashed)

A fatal system signal has occurred: segmentation violation

In both cases, the current module is MkFitProducer. Is it a coincidence, or do we have a regression?

Note that igprof mp does not crash in these workflows, and the crash happens around event 230-260. Since jenkins tries to run igprof several times in case of failure, it looks like it's reproducible.

@slava77 @gartung

The text was updated successfully, but these errors were encountered:

jpata · 2022-05-05T11:58:22Z

assign reconstruction

cmsbuild · 2022-05-05T11:58:27Z

New categories assigned: reconstruction

@jpata,@slava77,@clacaputo you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild · 2022-05-05T11:58:28Z

A new Issue was created by @jpata Joosep Pata.

@Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

slava77 · 2022-05-05T13:05:44Z

@jpata
does this happen on the same event and in the same module?

jpata · 2022-05-05T13:09:40Z

It's not the same event (but roughly ~230...260 events in) in different releases, nor the same module (highPtTripletStepTrackCandidatesMkFit vs. detachedTripletStepTrackCandidatesMkFit).

slava77 · 2022-05-05T13:30:58Z

It's not the same event (but roughly ~230...260 events in) in different releases, nor the same module (highPtTripletStepTrackCandidatesMkFit vs. detachedTripletStepTrackCandidatesMkFit).

there were no recent updates in the propagate or kalman update routines.
This still seems similar to the previous case where the issue was with the profiler itself having some outdated (was it TBB or pthread?) dependencies.

gartung · 2022-05-05T17:01:04Z

It could be some new TBB thread behavior. I could try running the work flow with gpertools cpu profiling enabled to see if has the same segfault.

Is this related to the segfault in propagateHelixToZMPlex when fast-math optimizations are enabled?

gartung · 2022-05-05T17:04:15Z

This is the fix that Steve Lantz suggested

-
-  void propagateHelixToZMPlex(const MPlexLS& inErr,
+  
+  void __attribute__ ((optimize("no-inline"))) propagateHelixToZMPlex(const MPlexLS& inErr,

gartung · 2022-05-05T17:05:53Z

The crash is caused by the () operator being optimized out.

jpata · 2022-05-06T13:18:18Z

@gartung do we know why the inlining / optimization of the code being profiled causes the profiler to crash?
I suppose ideally the fix should be on the profiler side, rather than changing what I understand is valid code on the CMSSW side.

gartung · 2022-05-06T14:01:07Z

I will run the profiler with debugging statements enabled to see if that tells me anything. The last segfault was "fixed" by updating libunwind.

gartung · 2022-05-06T14:02:16Z

I will also run the gpertools profiler to see if it segfaults.

gartung · 2022-05-06T15:48:39Z

First run through with the same file did not produce a segfault.

jpata · 2022-05-10T11:07:43Z

It segfaults reliably about halfway through the 500 events if you don't use gdb or don't recompile with debugging symbols, I just tried again on vocms011. Is it possible that gdb or debug symbols make the crash go away?

gartung · 2022-05-10T13:33:34Z

is the job run inside a Singularity image?

gartung · 2022-05-10T13:43:01Z

Nevermind. I did get a crash on the second run without gdb that I started Friday evening.

gartung · 2022-05-10T16:04:29Z

The segfault is in libunwind or libdwarf. I am attempting catch the segault in gdb.

gartung · 2022-05-10T16:33:13Z

I am trying this patch
libunwind/libunwind#348

gartung · 2022-05-10T16:39:47Z

Patch was incorporated into head of 1.6-stable branch in Dec2021. I am updating the libunwind spec to use the latest commit on that branch. I will test with the updated libunwind.

gartung · 2022-05-10T16:51:47Z

Captured backtrace in debugger

_ULx86_64_dwarf_eval_expr (c=c@entry=0x7fffffff03c0, stack_val=stack_val@entry=0, addr=addr@entry=0x7ffffffef828, len=<optimized out>, 
    valp=valp@entry=0x7ffffffef830, is_register=is_register@entry=0x7ffffffef820) at dwarf/Gexpr.c:432
432     dwarf/Gexpr.c: No such file or directory.
(gdb) bt
#0  _ULx86_64_dwarf_eval_expr (c=c@entry=0x7fffffff03c0, stack_val=stack_val@entry=0, addr=addr@entry=0x7ffffffef828, len=<optimized out>, 
    valp=valp@entry=0x7ffffffef830, is_register=is_register@entry=0x7ffffffef820) at dwarf/Gexpr.c:432
#1  0x00007ffff517ac48 in eval_location_expr (as=<optimized out>, a=<optimized out>, arg=<optimized out>, locp=<synthetic pointer>, 
    addr=<optimized out>, stack_val=0, c=0x7fffffff03c0) at dwarf/Gparser.c:756
#2  apply_reg_state (c=c@entry=0x7fffffff03c0, rs=rs@entry=0x7ffffffefa98) at dwarf/Gparser.c:817
#3  0x00007ffff517b249 in _ULx86_64_dwarf_step (c=c@entry=0x7fffffff03c0) at dwarf/Gparser.c:974
#4  0x00007ffff5177116 in _ULx86_64_step (cursor=cursor@entry=0x7fffffff03c0) at x86_64/Gstep.c:75
#5  0x00007ffff51753d9 in slow_backtrace (uc=0x7ffffffefc10, size=800, buffer=0x7fffffff0840) at mi/backtrace.c:45
#6  unw_backtrace (buffer=buffer@entry=0x7fffffff0840, size=size@entry=800) at mi/backtrace.c:72
#7  0x00007ffff7fdaf45 in IgHookTrace::stacktrace (addresses=addresses@entry=0x7fffffff0840, nmax=nmax@entry=800)
    at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_4_0_pre2-slc7_amd64_gcc10/build/CMSSW_12_4_0_pre2-build/BUILD/slc7_amd64_gcc10/external/igprof/5.9.16-95dc8f7dd3ee3d76c20fd25518fc6fa9/igprof-6cc73b59d83ed6c9d73b455dc40857e700ef6ee4/src/walk-syms.cc:175
#8  0x00007ffff7fe4508 in profileSignalHandler ()
    at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_4_0_pre2-slc7_amd64_gcc10/build/CMSSW_12_4_0_pre2-build/BUILD/slc7_amd64_gcc10/external/igprof/5.9.16-95dc8f7dd3ee3d76c20fd25518fc6fa9/igprof-6cc73b59d83ed6c9d73b455dc40857e700ef6ee4/src/profile-perf.cc:66
#9  <signal handler called>
#10 0x00007fffbf3e8eae in mkfit::propagateHelixToRMPlex(Matriplex::MatriplexSym<float, 6, 4> const&, Matriplex::Matriplex<float, 6, 1, 4> const&, Matriplex::Matriplex<int, 1, 1, 4> const&, Matriplex::Matriplex<float, 1, 1, 4> const&, Matriplex::MatriplexSym<float, 6, 4>&, Matriplex::Matriplex<float, 6, 1, 4>&, int, mkfit::PropagationFlags, Matriplex::Matriplex<int, 1, 1, 4> const*) ()
   from /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_4_0_pre3/lib/slc7_amd64_gcc10/libRecoTrackerMkFitCore.so
Backtrace stopped: Cannot access memory at address 0x3ae

gartung · 2022-05-10T16:52:18Z

Looks like this bug might be addressed by updating libunwind.

jpata · 2022-05-17T08:44:16Z

Looks like it still crashed in 12_4_0_pre4: /eos/cms/store/user/cmsbuild/profiling/data/CMSSW_12_4_0_pre4/slc7_amd64_gcc10/11834.21/step3_igprof_cpu.txt

Begin processing the 183rd record. Run 1, Event 183, LumiSection 1 on stream 0 at 16-May-2022 22:13:58.719 CEST
%MSG-e TrajectoryNotPosDef:  TrackProducer:detachedTripletStepTracks  16-May-2022 22:14:02 CEST Run: 1 Event: 183
Trajectory covariance is not positive-definite
%MSG
...
#9  apply_reg_state (c=c@entry=0x7fff9609eb40, rs=rs@entry=0x7fff9609e218) at dwarf/Gparser.c:826
#10 0x00007efe34c6d25c in _ULx86_64_dwarf_step (c=c@entry=0x7fff9609eb40) at dwarf/Gparser.c:981
#11 0x00007efe34c69116 in _ULx86_64_step (cursor=cursor@entry=0x7fff9609eb40) at x86_64/Gstep.c:75
#12 0x00007efe34c673d9 in slow_backtrace (uc=0x7fff9609e390, size=800, buffer=0x7fff9609efc0) at mi/backtrace.c:45
#13 unw_backtrace (buffer=buffer@entry=0x7fff9609efc0, size=size@entry=800) at mi/backtrace.c:72
#14 0x00007efe37ad3f45 in IgHookTrace::stacktrace (addresses=addresses@entry=0x7fff9609efc0, nmax=nmax@entry=800) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_4_0_pre4-slc7_amd64_gcc10/build/CMSSW_12_4_0_pre4-build/BUILD/slc7_amd64_gcc10/external/igprof/5.9.16-b55509ef966dc5068aea62f260fb3883/igprof-6cc73b59d83ed6c9d73b455dc40857e700ef6ee4/src/walk-syms.cc:175
#15 0x00007efe37add508 in profileSignalHandler () at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_4_0_pre4-slc7_amd64_gcc10/build/CMSSW_12_4_0_pre4-build/BUILD/slc7_amd64_gcc10/external/igprof/5.9.16-b55509ef966dc5068aea62f260fb3883/igprof-6cc73b59d83ed6c9d73b455dc40857e700ef6ee4/src/profile-perf.cc:66
#16 <signal handler called>
#17 0x00007efdff9abd1a in mkfit::propagateHelixToZMPlex(Matriplex::MatriplexSym<float, 6, 4> const&, Matriplex::Matriplex<float, 6, 1, 4> const&, Matriplex::Matriplex<int, 1, 1, 4> const&, Matriplex::Matriplex<float, 1, 1, 4> const&, Matriplex::MatriplexSym<float, 6, 4>&, Matriplex::Matriplex<float, 6, 1, 4>&, int, mkfit::PropagationFlags, Matriplex::Matriplex<int, 1, 1, 4> const*) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_4_0_pre4/lib/slc7_amd64_gcc10/libRecoTrackerMkFitCore.so

Current Modules:

Module: MkFitProducer:pixelLessStepTrackCandidatesMkFit (crashed)

gartung · 2022-05-17T13:56:03Z

@smuzaffar has the update to libunwind/igprof been incorporated into the cmssw toolset?

smuzaffar · 2022-05-17T13:59:41Z

yes @gartung , cms-sw/cmsdist#7853 has been integrated

gartung · 2022-05-17T14:04:45Z

@jpata can you please verify what igprof and libunwind are being used in the profiling job.

gartung · 2022-05-17T14:05:48Z

@smuzaffar did the updates make it into the CMSSW_12_4_0_pre4 toolset?

gartung · 2022-05-18T15:41:33Z

The problem with the gperftools profile was that it had no stack traces. I was running with IGPROF environment variables defined to run igprof under gdb. This prevented gperftools libprofile.so from recording stack traces.

gartung · 2022-05-18T18:09:00Z

The gperftools libprofile.so also hit a segfault in libunwind.

#4  <signal handler called>
#5  dwarf_readu64 (arg=<optimized out>, val=<synthetic pointer>, addr=0x7ffd40738188, a=<optimized out>, as=<optimized out>) at ../include/dwarf_i.h:128
#6  dwarf_readw (arg=<optimized out>, val=<synthetic pointer>, addr=0x7ffd40738188, a=<optimized out>, as=<optimized out>) at ../include/dwarf_i.h:280
#7  _ULx86_64_dwarf_eval_expr (c=c@entry=0x7ffd40738ba0, stack_val=stack_val@entry=0, addr=addr@entry=0x7ffd40738408, len=<optimized out>, valp=valp@entry=0x7ffd40738410, is_register=is_register@entry=0x7ffd40738400) at dwarf/Gexpr.c:432
#8  0x00007fa6414a4c50 in eval_location_expr (as=<optimized out>, a=<optimized out>, arg=<optimized out>, locp=<synthetic pointer>, addr=<optimized out>, stack_val=0, c=0x7ffd40738ba0) at dwarf/Gparser.c:756
#9  apply_reg_state (c=c@entry=0x7ffd40738ba0, rs=rs@entry=0x7ffd40738678) at dwarf/Gparser.c:826
#10 0x00007fa6414a525c in _ULx86_64_dwarf_step (c=c@entry=0x7ffd40738ba0) at dwarf/Gparser.c:981
#11 0x00007fa6414a1116 in _ULx86_64_step (cursor=cursor@entry=0x7ffd40738ba0) at x86_64/Gstep.c:75
#12 0x00007fa6428d2ae6 in GetStackTraceWithContext_libunwind (result=<optimized out>, max_depth=253, skip_count=1081313255, ucp=<optimized out>) at src/stacktrace_libunwind-inl.h:138
#13 0x00007fa6428d332f in GetStackTraceWithContext (result=0x7ffd40739018, max_depth=253, skip_count=3, uc=0x7ffd40739880) at src/stacktrace.cc:336
#14 0x00007fa6428cffa1 in CpuProfiler::prof_handler (signal_ucontext=<optimized out>, cpu_profiler=0x7fa6428db660, sig=<optimized out>) at src/profiler.cc:353
#15 0x00007fa6428d0973 in ProfileHandler::SignalHandler (sig=27, sinfo=0x7ffd407399b0, ucontext=0x7ffd40739880) at src/profile-handler.cc:545
#16 <signal handler called>
#17 0x00007fa618734d1a in mkfit::propagateHelixToZMPlex(Matriplex::MatriplexSym<float, 6, 4> const&, Matriplex::Matriplex<float, 6, 1, 4> const&, Matriplex::Matriplex<int, 1, 1, 4> const&, Matriplex::Matriplex<float, 1, 1, 4> const&, Matriplex::MatriplexSym<float, 6, 4>&, Matriplex::Matriplex<float, 6, 1, 4>&, int, mkfit::PropagationFlags, Matriplex::Matriplex<int, 1, 1, 4> const*) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_4_0_pre4/lib/slc7_amd64_gcc10/libRecoTrackerMkFitCore.so

gartung · 2022-05-18T19:57:49Z

It looks like I was using an older libunwind with the original bug. I am trying gperftools with the patched libunwind and so far it hasn't crashed.

smuzaffar · 2022-05-19T08:03:15Z

@gartung , can you please also check why igprof fails for el8? Many ib-run-profiling jobs for el8_amd64_gcc10 are failing with message like [a]

[a]

08:33:05 Parsing igprof output file igprofMEM_step5.1.mp:  done
08:33:05 No counter values in profile data.Reading profile data.
08:33:05 Parsing igprof output file igprofMEM_step5.10.mp:  done
08:33:05 No counter values in profile data.Reading profile data.
08:33:05 Parsing igprof output file igprofMEM_step5.19.mp:  done
08:33:05 No counter values in profile data.Reading profile data.
08:33:05 Parsing igprof output file igprofMEM_step5.mp:  done
08:33:05 No counter values in profile data.+ ERR=1
08:33:05 + false
08:33:05 + exit 1

gartung · 2022-05-19T12:01:36Z

The memory profiling is run with igprof -mp -o ./igprofMEM_step3.mp -- cmsRunGlibC step3_igprof.py. The version of glibc changed with el8 so I suspect something has changed that is preventing igprof from recording memory allocation/deallocation.

gartung · 2022-05-19T12:57:05Z

These messages probably have something to do with it

igprof -d -mp -z -o step3_igprofMEM.gz -t cmsRunGlibC cmsRunGlibC step3_igprof.py 2>&1 | tee step3_igprof_mem.txt
*** IgProf(8897, 1652964911.131): profiler activated in cmsRunGlibC, main thread id 0x7fb41b706c00
*** IgProf(8897, 1652964911.131): profiler options: igprof:out='|gzip -c>step3_igprofMEM.gz' mem
*** IgProf(8897, 1652964911.131): exit (0x7fb41c205300) + 0x0: unrecognised prologue (found 0xf3 0xf 0x1e 0xfa)
*** IgProf(8897, 1652964911.131): _exit (0x7fb41c2ae8c0) + 0x0: unrecognised prologue (found 0xf3 0xf 0x1e 0xfa)
*** IgProf(8897, 1652964911.131): kill (0x7fb41c1d7460) + 0x0: unrecognised prologue (found 0xf3 0xf 0x1e 0xfa)
*** IgProf(8897, 1652964911.131): pthread_create (0x7fb41c5814b0) + 0x0: unrecognised prologue (found 0xf3 0xf 0x1e 0xfa)
*** IgProf(8897, 1652964911.131): dlsym(self, 'pthread_create'): cmsRunGlibC: undefined symbol: pthread_create, version GLIBC_2.0
*** IgProf(8897, 1652964911.131): dlsym(self, 'pthread_create'): cmsRunGlibC: undefined symbol: pthread_create, version GLIBC_2.1
*** IgProf(8897, 1652964911.131): memory profiler: reporting memory use without allocation overhead
*** IgProf(8897, 1652964911.131): malloc (0x7fb41c250660) + 0x0: unrecognised prologue (found 0xf3 0xf 0x1e 0xfa)
*** IgProf(8897, 1652964911.131): calloc (0x7fb41c251420) + 0x0: unrecognised prologue (found 0xf3 0xf 0x1e 0xfa)
*** IgProf(8897, 1652964911.131): realloc (0x7fb41c250f10) + 0x0: unrecognised prologue (found 0xf3 0xf 0x1e 0xfa)
*** IgProf(8897, 1652964911.131): posix_memalign (0x7fb41c2520e0) + 0x0: unrecognised prologue (found 0xf3 0xf 0x1e 0xfa)
*** IgProf(8897, 1652964911.131): memalign (0x7fb41c251360) + 0x0: unrecognised prologue (found 0xf3 0xf 0x1e 0xfa)
*** IgProf(8897, 1652964911.131): pvalloc (0x7fb41c2513b0) + 0x0: unrecognised prologue (found 0xf3 0xf 0x1e 0xfa)
*** IgProf(8897, 1652964911.131): valloc (0x7fb41c251370) + 0x0: unrecognised prologue (found 0xf3 0xf 0x1e 0xfa)
*** IgProf(8897, 1652964911.131): free (0x7fb41c250cb0) + 0x0: unrecognised prologue (found 0xf3 0xf 0x1e 0xfa)
*** IgProf(8897, 1652964911.131): dlsym(self, 'tc_new'): /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_0_pre4/external/el8_amd64_gcc10/lib/libigprof.so: undefined symbol: tc_new
*** IgProf(8897, 1652964911.149): findsym('tc_new', '(null)', 4): symbol not found
*** IgProf(8897, 1652964911.149): dlsym(self, 'tc_new_nothrow'): /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_0_pre4/external/el8_amd64_gcc10/lib/libigprof.so: undefined symbol: tc_new_nothrow
*** IgProf(8897, 1652964911.166): findsym('tc_new_nothrow', '(null)', 4): symbol not found
*** IgProf(8897, 1652964911.166): dlsym(self, 'tc_newarray'): /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_0_pre4/external/el8_amd64_gcc10/lib/libigprof.so: undefined symbol: tc_newarray
*** IgProf(8897, 1652964911.183): findsym('tc_newarray', '(null)', 4): symbol not found
*** IgProf(8897, 1652964911.183): dlsym(self, 'tc_newarray_nothrow'): /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_0_pre4/external/el8_amd64_gcc10/lib/libigprof.so: undefined symbol: tc_newarray_nothrow
*** IgProf(8897, 1652964911.201): findsym('tc_newarray_nothrow', '(null)', 4): symbol not found
*** IgProf(8897, 1652964911.201): dlsym(self, 'tc_malloc'): /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_0_pre4/external/el8_amd64_gcc10/lib/libigprof.so: undefined symbol: tc_malloc
*** IgProf(8897, 1652964911.218): findsym('tc_malloc', '(null)', 4): symbol not found
*** IgProf(8897, 1652964911.219): dlsym(self, 'tc_calloc'): /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_0_pre4/external/el8_amd64_gcc10/lib/libigprof.so: undefined symbol: tc_calloc
*** IgProf(8897, 1652964911.236): findsym('tc_calloc', '(null)', 4): symbol not found
*** IgProf(8897, 1652964911.236): dlsym(self, 'tc_realloc'): /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_0_pre4/external/el8_amd64_gcc10/lib/libigprof.so: undefined symbol: tc_realloc
*** IgProf(8897, 1652964911.253): findsym('tc_realloc', '(null)', 4): symbol not found
*** IgProf(8897, 1652964911.253): dlsym(self, 'tc_posix_memalign'): /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_0_pre4/external/el8_amd64_gcc10/lib/libigprof.so: undefined symbol: tc_posix_memalign
*** IgProf(8897, 1652964911.271): findsym('tc_posix_memalign', '(null)', 4): symbol not found
*** IgProf(8897, 1652964911.271): dlsym(self, 'tc_memalign'): /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_0_pre4/external/el8_amd64_gcc10/lib/libigprof.so: undefined symbol: tc_memalign
*** IgProf(8897, 1652964911.288): findsym('tc_memalign', '(null)', 4): symbol not found
*** IgProf(8897, 1652964911.289): dlsym(self, 'tc_pvalloc'): /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_0_pre4/external/el8_amd64_gcc10/lib/libigprof.so: undefined symbol: tc_pvalloc
*** IgProf(8897, 1652964911.306): findsym('tc_pvalloc', '(null)', 4): symbol not found
*** IgProf(8897, 1652964911.306): dlsym(self, 'tc_valloc'): /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_0_pre4/external/el8_amd64_gcc10/lib/libigprof.so: undefined symbol: tc_valloc
*** IgProf(8897, 1652964911.323): findsym('tc_valloc', '(null)', 4): symbol not found
*** IgProf(8897, 1652964911.324): dlsym(self, 'tc_free'): /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_0_pre4/external/el8_amd64_gcc10/lib/libigprof.so: undefined symbol: tc_free
*** IgProf(8897, 1652964911.341): findsym('tc_free', '(null)', 4): symbol not found
*** IgProf(8897, 1652964911.341): dlsym(self, 'tc_cfree'): /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_0_pre4/external/el8_amd64_gcc10/lib/libigprof.so: undefined symbol: tc_cfree
*** IgProf(8897, 1652964911.358): findsym('tc_cfree', '(null)', 4): symbol not found
*** IgProf(8897, 1652964911.358): dlsym(self, 'tc_delete'): /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_0_pre4/external/el8_amd64_gcc10/lib/libigprof.so: undefined symbol: tc_delete
*** IgProf(8897, 1652964911.376): findsym('tc_delete', '(null)', 4): symbol not found
*** IgProf(8897, 1652964911.376): dlsym(self, 'tc_delete_nothrow'): /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_0_pre4/external/el8_amd64_gcc10/lib/libigprof.so: undefined symbol: tc_delete_nothrow
*** IgProf(8897, 1652964911.393): findsym('tc_delete_nothrow', '(null)', 4): symbol not found
*** IgProf(8897, 1652964911.393): dlsym(self, 'tc_deletearray'): /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_0_pre4/external/el8_amd64_gcc10/lib/libigprof.so: undefined symbol: tc_deletearray
*** IgProf(8897, 1652964911.410): findsym('tc_deletearray', '(null)', 4): symbol not found
*** IgProf(8897, 1652964911.410): dlsym(self, 'tc_deletearray_nothrow'): /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_0_pre4/external/el8_amd64_gcc10/lib/libigprof.so: undefined symbol: tc_deletearray_nothrow
*** IgProf(8897, 1652964911.427): findsym('tc_deletearray_nothrow', '(null)', 4): symbol not found
*** IgProf(8897, 1652964911.428): dlsym(self, 'tc_delete_sized'): /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_0_pre4/external/el8_amd64_gcc10/lib/libigprof.so: undefined symbol: tc_delete_sized
*** IgProf(8897, 1652964911.445): findsym('tc_delete_sized', '(null)', 4): symbol not found
*** IgProf(8897, 1652964911.445): memory profiler enabled

gartung · 2022-05-19T13:01:34Z

@smuzaffar Can you disable the retries for now. Once is enough to verify that memory profiling is broken.

smuzaffar · 2022-05-19T13:09:14Z

ok , retires disbaled now

gartung · 2022-05-19T14:14:56Z

Using cmsRun instead of cmsRunGlibC produces a memory profile. @jpata you will need to update the release profiling script.

slava77 · 2022-05-19T14:45:15Z

Using cmsRun instead of cmsRunGlibC produces a memory profile. @jpata you will need to update the release profiling script.

this wasn't capable to record deallocations previously. Was it fixed?

gartung · 2022-05-19T15:30:43Z

How would I tell? I have memory profiles produced locally. What would tell you if the deallocations were recorded correctly?

slava77 · 2022-05-19T16:42:15Z

How would I tell? I have memory profiles produced locally. What would tell you if the deallocations were recorded correctly?

IIRC, MEM_LIVE will be closer to MEM_TOTAL (but perhaps it was just a factor of a few larger than the normal; I don't remember that well).

gartung · 2022-05-20T17:06:35Z

Using cmsRun on el8 and looking at MEM_LIVE and MEM_TOTAL on the second to last event shows the they are not close in value

==> IgProf.39.MEM_LIVE.txt <==
Counter: MEM_LIVE

----------------------------------------------------------------------
Flat profile (cumulative >= 1%)

% total           Total           Calls  Function
  100.0  20'367'236'321      19'805'715  <spontaneous> [1]
  100.0  20'367'234'873      19'805'708  __libc_start_main [2]
  100.0  20'367'234'873      19'805'708  _start [3]
  100.0  20'367'234'873      19'805'708  main [4]

==> IgProf.39.MEM_TOTAL.txt <==
Counter: MEM_TOTAL

----------------------------------------------------------------------
Flat profile (cumulative >= 1%)

% total            Total            Calls  Function
  100.0  185'828'883'715      968'688'276  <spontaneous> [1]
  100.0  185'828'803'507      968'685'669  main [2]
  100.0  185'828'802'307      968'685'619  __libc_start_main [3]
  100.0  185'828'801'323      968'685'578  _start [4]

slava77 · 2022-05-20T17:49:59Z

Using cmsRun on el8 and looking at MEM_LIVE and MEM_TOTAL on the second to last event shows the they are not close in value

Thanks for checking; apparently it's the other case from my attempt to recall it was just a factor of a few larger than the normal.

20 GB in IgProf.39.MEM_LIVE.txt looks wrong.
In case you are looking at the Run3 job, the value should be closer to 2GB for step3.

gartung · 2022-05-20T17:54:03Z

Yes this is for the Run3 job.

gartung · 2022-05-20T17:57:10Z

I am running a CPU profiling job with the same step3 and 400 events on el8 to see if it segfaults. Although the recent patch makes it occur less often.

gartung · 2022-05-20T18:04:51Z

Jemalloc, which I think is linked into cmsRun, has a profiling option
https://github.com/jemalloc/jemalloc/wiki/Use-Case%3A-Heap-Profiling

gartung · 2022-05-20T18:22:52Z

CPU profiling crashed around event 369 running on el8.

#4  <signal handler called>
#5  dwarf_readu64 (arg=<optimized out>, val=<synthetic pointer>, addr=0x7ffd4c594bc8, a=<optimized out>, as=<optimized out>) at ../include/dwarf_i.h:128
#6  dwarf_readw (arg=<optimized out>, val=<synthetic pointer>, addr=0x7ffd4c594bc8, a=<optimized out>, as=<optimized out>) at ../include/dwarf_i.h:280
#7  _ULx86_64_dwarf_eval_expr (c=c@entry=0x7ffd4c595a00, stack_val=stack_val@entry=0, addr=addr@entry=0x7ffd4c594e48, len=<optimized out>, valp=valp@entry=0x7ffd4c594e50, is_register=is_register@entry=0x7ffd4c594e40) at dwarf/Gexpr.c:432
#8  0x00007f601544dc50 in eval_location_expr (as=<optimized out>, a=<optimized out>, arg=<optimized out>, locp=<synthetic pointer>, addr=<optimized out>, stack_val=0, c=0x7ffd4c595a00) at dwarf/Gparser.c:756
#9  apply_reg_state (c=c@entry=0x7ffd4c595a00, rs=rs@entry=0x7ffd4c5950b8) at dwarf/Gparser.c:826
#10 0x00007f601544e25c in _ULx86_64_dwarf_step (c=c@entry=0x7ffd4c595a00) at dwarf/Gparser.c:981
#11 0x00007f601544a116 in _ULx86_64_step (cursor=cursor@entry=0x7ffd4c595a00) at x86_64/Gstep.c:75
#12 0x00007f60154483d9 in slow_backtrace (uc=0x7ffd4c595230, size=800, buffer=0x7ffd4c595e80) at mi/backtrace.c:45
#13 unw_backtrace (buffer=buffer@entry=0x7ffd4c595e80, size=size@entry=800) at mi/backtrace.c:72
#14 0x00007f6018324f35 in IgHookTrace::stacktrace (addresses=addresses@entry=0x7ffd4c595e80, nmax=nmax@entry=800) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_4_0_pre4-el8_amd64_gcc10/build/CMSSW_12_4_0_pre4-build/BUILD/el8_amd64_gcc10/external/igprof/5.9.16-b55509ef966dc5068aea62f260fb3883/igprof-6cc73b59d83ed6c9d73b455dc40857e700ef6ee4/src/walk-syms.cc:175
#15 0x00007f601832e4f8 in profileSignalHandler () at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_4_0_pre4-el8_amd64_gcc10/build/CMSSW_12_4_0_pre4-build/BUILD/el8_amd64_gcc10/external/igprof/5.9.16-b55509ef966dc5068aea62f260fb3883/igprof-6cc73b59d83ed6c9d73b455dc40857e700ef6ee4/src/profile-perf.cc:66
#16 <signal handler called>
#17 0x00007f5fe17de472 in mkfit::kalmanOperation(int, Matriplex::MatriplexSym<float, 6, 4> const&, Matriplex::Matriplex<float, 6, 1, 4> const&, Matriplex::MatriplexSym<float, 3, 4> const&, Matriplex::Matriplex<float, 3, 1, 4> const&, Matriplex::MatriplexSym<float, 6, 4>&, Matriplex::Matriplex<float, 6, 1, 4>&, Matriplex::Matriplex<float, 1, 1, 4>&, int) () from /home/users/gartung/CMSSW_12_4_0_pre4/lib/el8_amd64_gcc10/libRecoTrackerMkFitCore.so

gartung · 2022-05-25T17:07:08Z

I found a fix for the memory profiling problem on el8. I will reverse this pull request once the Igprof fix goes in.
cms-externals/igprof#3

gartung · 2022-06-03T13:05:28Z

Still seeing the occasional segfault in cpu profiling

#4  <signal handler called>
#5  dwarf_readu64 (arg=<optimized out>, val=<synthetic pointer>, addr=0x7fffff393a08, a=<optimized out>, as=<optimized out>) at ../include/dwarf_i.h:128
#6  dwarf_readw (arg=<optimized out>, val=<synthetic pointer>, addr=0x7fffff393a08, a=<optimized out>, as=<optimized out>) at ../include/dwarf_i.h:280
#7  _ULx86_64_dwarf_eval_expr (c=c@entry=0x7fffff394840, stack_val=stack_val@entry=0, addr=addr@entry=0x7fffff393c88, len=<optimized out>, valp=valp@entry=0x7fffff393c90, is_register=is_register@entry=0x7fffff393c80) at dwarf/Gexpr.c:432
#8  0x00007f29bbbe1c50 in eval_location_expr (as=<optimized out>, a=<optimized out>, arg=<optimized out>, locp=<synthetic pointer>, addr=<optimized out>, stack_val=0, c=0x7fffff394840) at dwarf/Gparser.c:756
#9  apply_reg_state (c=c@entry=0x7fffff394840, rs=rs@entry=0x7fffff393ef8) at dwarf/Gparser.c:826
#10 0x00007f29bbbe225c in _ULx86_64_dwarf_step (c=c@entry=0x7fffff394840) at dwarf/Gparser.c:981
#11 0x00007f29bbbde116 in _ULx86_64_step (cursor=cursor@entry=0x7fffff394840) at x86_64/Gstep.c:75
#12 0x00007f29bbbdc3d9 in slow_backtrace (uc=0x7fffff394070, size=800, buffer=0x7fffff394cc0) at mi/backtrace.c:45
#13 unw_backtrace (buffer=buffer@entry=0x7fffff394cc0, size=size@entry=800) at mi/backtrace.c:72
#14 0x00007f29beab1f65 in IgHookTrace::stacktrace (addresses=addresses@entry=0x7fffff394cc0, nmax=nmax@entry=800) at /data/cmsbld/jenkins/workspace/build-any-ib/w/BUILD/el8_amd64_gcc10/external/igprof/5.9.16-70108befbf35d571a521c8cd66ab22b6/igprof-fed6dc47309b008b2f2bbafd270604286de6ab54/src/walk-syms.cc:175
#15 0x00007f29beabb528 in profileSignalHandler () at /data/cmsbld/jenkins/workspace/build-any-ib/w/BUILD/el8_amd64_gcc10/external/igprof/5.9.16-70108befbf35d571a521c8cd66ab22b6/igprof-fed6dc47309b008b2f2bbafd270604286de6ab54/src/profile-perf.cc:66
#16 <signal handler called>
#17 0x00007f29851c5472 in mkfit::kalmanOperation(int, Matriplex::MatriplexSym<float, 6, 4> const&, Matriplex::Matriplex<float, 6, 1, 4> const&, Matriplex::MatriplexSym<float, 3, 4> const&, Matriplex::Matriplex<float, 3, 1, 4> const&, Matriplex::MatriplexSym<float, 6, 4>&, Matriplex::Matriplex<float, 6, 1, 4>&, Matriplex::Matriplex<float, 1, 1, 4>&, int) () from /cvmfs/cms-ib.cern.ch/nweek-02735/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-01-2300/lib/el8_amd64_gcc10/libRecoTrackerMkFitCore.so

gartung · 2022-06-06T20:06:15Z

Updated libunwind to the head of master to further help with segfaults in the dwarf_* functions
cms-sw/cmsdist#7910

jpata · 2022-08-31T10:55:15Z

+reconstruction

recently, we haven't seen segfaults in the Run3 or Phase2 workflows in the reco profiling.
I think the issue is addressed in Update to HEAD of libunwind master. Patch is no longer needed. cmsdist#7910.

cmsbuild · 2022-08-31T10:55:33Z

This issue is fully signed and ready to be closed.

cmsbuild added pending-signatures reconstruction-pending labels May 5, 2022

gartung mentioned this issue May 10, 2022

Update to head of 1.6 branch to pick up fix for segfault in gdwarf. cms-sw/cmsdist#7853

Merged

smuzaffar closed this as completed in cms-sw/cmsdist#7853 May 11, 2022

smuzaffar reopened this May 18, 2022

slava77 mentioned this issue May 19, 2022

TrackTfClassifier with Batch evaluation of the DNN #38007

Merged

jpata closed this as completed Aug 31, 2022

cmsbuild added fully-signed reconstruction-approved and removed reconstruction-pending pending-signatures labels Aug 31, 2022

igprof pp segfault in 12_3_0, 12_3_0_pre4 in the Run3 reco step #37816

igprof pp segfault in 12_3_0, 12_3_0_pre4 in the Run3 reco step #37816

Comments

jpata commented May 5, 2022

jpata commented May 5, 2022

cmsbuild commented May 5, 2022

cmsbuild commented May 5, 2022

slava77 commented May 5, 2022

jpata commented May 5, 2022

slava77 commented May 5, 2022

gartung commented May 5, 2022

gartung commented May 5, 2022

gartung commented May 5, 2022

jpata commented May 6, 2022

gartung commented May 6, 2022

gartung commented May 6, 2022

gartung commented May 6, 2022

jpata commented May 10, 2022

gartung commented May 10, 2022

gartung commented May 10, 2022

gartung commented May 10, 2022

gartung commented May 10, 2022

gartung commented May 10, 2022

gartung commented May 10, 2022

gartung commented May 10, 2022

jpata commented May 17, 2022

gartung commented May 17, 2022

smuzaffar commented May 17, 2022

gartung commented May 17, 2022

gartung commented May 17, 2022

gartung commented May 18, 2022

gartung commented May 18, 2022 • edited

gartung commented May 18, 2022

smuzaffar commented May 19, 2022

gartung commented May 19, 2022 • edited

gartung commented May 19, 2022

gartung commented May 19, 2022

smuzaffar commented May 19, 2022 • edited

gartung commented May 19, 2022

slava77 commented May 19, 2022

gartung commented May 19, 2022

slava77 commented May 19, 2022

gartung commented May 20, 2022

slava77 commented May 20, 2022

gartung commented May 20, 2022

gartung commented May 20, 2022

gartung commented May 20, 2022

gartung commented May 20, 2022

gartung commented May 25, 2022

gartung commented Jun 3, 2022 • edited

gartung commented Jun 6, 2022

jpata commented Aug 31, 2022

cmsbuild commented Aug 31, 2022

gartung commented May 18, 2022 •

edited

gartung commented May 19, 2022 •

edited

smuzaffar commented May 19, 2022 •

edited

gartung commented Jun 3, 2022 •

edited