Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

igprof pp segfault in 12_3_0, 12_3_0_pre4 in the Run3 reco step #37816

Closed
jpata opened this issue May 5, 2022 · 60 comments · Fixed by cms-sw/cmsdist#7853
Closed

igprof pp segfault in 12_3_0, 12_3_0_pre4 in the Run3 reco step #37816

jpata opened this issue May 5, 2022 · 60 comments · Fixed by cms-sw/cmsdist#7853

Comments

@jpata
Copy link
Contributor

jpata commented May 5, 2022

In two recent releases, igprof pp crashes in the reco step in 11834.21:

$ tail -n10 /eos/cms/store/user/cmsbuild/profiling/data/CMSSW_12_3_0/slc7_amd64_gcc10/11834.21/step3_igprof_cpu.txt
#14 0x00007f36c6833f45 in IgHookTrace::stacktrace (addresses=addresses@entry=0x7ffc58699700, nmax=nmax@entry=800) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_0_pre5-slc7_amd64_gcc10/build/CMSSW_12_3_0_pre5-build/BUILD/slc7_amd64_gcc10/external/igprof/5.9.16-f8a2b39c36d2a318d6c7c0f619242bdb/igprof-6cc73b59d83ed6c9d73b455dc40857e700ef6ee4/src/walk-syms.cc:175
#15 0x00007f36c683d508 in profileSignalHandler () at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_0_pre5-slc7_amd64_gcc10/build/CMSSW_12_3_0_pre5-build/BUILD/slc7_amd64_gcc10/external/igprof/5.9.16-f8a2b39c36d2a318d6c7c0f619242bdb/igprof-6cc73b59d83ed6c9d73b455dc40857e700ef6ee4/src/profile-perf.cc:66
#16 <signal handler called>
#17 0x00007f368f898ee2 in mkfit::kalmanOperation(int, Matriplex::MatriplexSym<float, 6, 4> const&, Matriplex::Matriplex<float, 6, 1, 4> const&, Matriplex::MatriplexSym<float, 3, 4> const&, Matriplex::Matriplex<float, 3, 1, 4> const&, Matriplex::MatriplexSym<float, 6, 4>&, Matriplex::Matriplex<float, 6, 1, 4>&, Matriplex::Matriplex<float, 1, 1, 4>&, int) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_0/lib/slc7_amd64_gcc10/libRecoTrackerMkFitCore.so

Current Modules:

Module: MkFitProducer:detachedTripletStepTrackCandidatesMkFit (crashed)

A fatal system signal has occurred: segmentation violation
$ tail -n10 /eos/cms/store/user/cmsbuild/profiling/data/CMSSW_12_4_0_pre3/slc7_amd64_gcc10/11834.21/step3_igprof_cpu.txt
#14 0x00007f9899389f45 in IgHookTrace::stacktrace (addresses=addresses@entry=0x7ffd8674d480, nmax=nmax@entry=800) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_4_0_pre2-slc7_amd64_gcc10/build/CMSSW_12_4_0_pre2-build/BUILD/slc7_amd64_gcc10/external/igprof/5.9.16-95dc8f7dd3ee3d76c20fd25518fc6fa9/igprof-6cc73b59d83ed6c9d73b455dc40857e700ef6ee4/src/walk-syms.cc:175
#15 0x00007f9899393508 in profileSignalHandler () at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_4_0_pre2-slc7_amd64_gcc10/build/CMSSW_12_4_0_pre2-build/BUILD/slc7_amd64_gcc10/external/igprof/5.9.16-95dc8f7dd3ee3d76c20fd25518fc6fa9/igprof-6cc73b59d83ed6c9d73b455dc40857e700ef6ee4/src/profile-perf.cc:66
#16 <signal handler called>
#17 0x00007f9860bcf8ea in mkfit::propagateHelixToZMPlex(Matriplex::MatriplexSym<float, 6, 4> const&, Matriplex::Matriplex<float, 6, 1, 4> const&, Matriplex::Matriplex<int, 1, 1, 4> const&, Matriplex::Matriplex<float, 1, 1, 4> const&, Matriplex::MatriplexSym<float, 6, 4>&, Matriplex::Matriplex<float, 6, 1, 4>&, int, mkfit::PropagationFlags, Matriplex::Matriplex<int, 1, 1, 4> const*) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_4_0_pre3/lib/slc7_amd64_gcc10/libRecoTrackerMkFitCore.so

Current Modules:

Module: MkFitProducer:highPtTripletStepTrackCandidatesMkFit (crashed)

A fatal system signal has occurred: segmentation violation

In both cases, the current module is MkFitProducer. Is it a coincidence, or do we have a regression?

Note that igprof mp does not crash in these workflows, and the crash happens around event 230-260. Since jenkins tries to run igprof several times in case of failure, it looks like it's reproducible.

@slava77 @gartung

@jpata
Copy link
Contributor Author

jpata commented May 5, 2022

assign reconstruction

@cmsbuild
Copy link
Contributor

cmsbuild commented May 5, 2022

New categories assigned: reconstruction

@jpata,@slava77,@clacaputo you have been requested to review this Pull request/Issue and eventually sign? Thanks

@cmsbuild
Copy link
Contributor

cmsbuild commented May 5, 2022

A new Issue was created by @jpata Joosep Pata.

@Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@slava77
Copy link
Contributor

slava77 commented May 5, 2022

@jpata
does this happen on the same event and in the same module?

@jpata
Copy link
Contributor Author

jpata commented May 5, 2022

It's not the same event (but roughly ~230...260 events in) in different releases, nor the same module (highPtTripletStepTrackCandidatesMkFit vs. detachedTripletStepTrackCandidatesMkFit).

@slava77
Copy link
Contributor

slava77 commented May 5, 2022

It's not the same event (but roughly ~230...260 events in) in different releases, nor the same module (highPtTripletStepTrackCandidatesMkFit vs. detachedTripletStepTrackCandidatesMkFit).

there were no recent updates in the propagate or kalman update routines.
This still seems similar to the previous case where the issue was with the profiler itself having some outdated (was it TBB or pthread?) dependencies.

@gartung
Copy link
Member

gartung commented May 5, 2022

It could be some new TBB thread behavior. I could try running the work flow with gpertools cpu profiling enabled to see if has the same segfault.

Is this related to the segfault in propagateHelixToZMPlex when fast-math optimizations are enabled?

@gartung
Copy link
Member

gartung commented May 5, 2022

This is the fix that Steve Lantz suggested

-
-  void propagateHelixToZMPlex(const MPlexLS& inErr,
+  
+  void __attribute__ ((optimize("no-inline"))) propagateHelixToZMPlex(const MPlexLS& inErr,

@gartung
Copy link
Member

gartung commented May 5, 2022

The crash is caused by the () operator being optimized out.

@jpata
Copy link
Contributor Author

jpata commented May 6, 2022

@gartung do we know why the inlining / optimization of the code being profiled causes the profiler to crash?
I suppose ideally the fix should be on the profiler side, rather than changing what I understand is valid code on the CMSSW side.

@gartung
Copy link
Member

gartung commented May 6, 2022

I will run the profiler with debugging statements enabled to see if that tells me anything. The last segfault was "fixed" by updating libunwind.

@gartung
Copy link
Member

gartung commented May 6, 2022

I will also run the gpertools profiler to see if it segfaults.

@gartung
Copy link
Member

gartung commented May 6, 2022

First run through with the same file did not produce a segfault.

@jpata
Copy link
Contributor Author

jpata commented May 10, 2022

It segfaults reliably about halfway through the 500 events if you don't use gdb or don't recompile with debugging symbols, I just tried again on vocms011. Is it possible that gdb or debug symbols make the crash go away?

@gartung
Copy link
Member

gartung commented May 10, 2022

is the job run inside a Singularity image?

@gartung
Copy link
Member

gartung commented May 10, 2022

Nevermind. I did get a crash on the second run without gdb that I started Friday evening.

@gartung
Copy link
Member

gartung commented May 10, 2022

The segfault is in libunwind or libdwarf. I am attempting catch the segault in gdb.

@gartung
Copy link
Member

gartung commented May 10, 2022

I am trying this patch
libunwind/libunwind#348

@gartung
Copy link
Member

gartung commented May 10, 2022

Patch was incorporated into head of 1.6-stable branch in Dec2021. I am updating the libunwind spec to use the latest commit on that branch. I will test with the updated libunwind.

@gartung
Copy link
Member

gartung commented May 10, 2022

Captured backtrace in debugger

_ULx86_64_dwarf_eval_expr (c=c@entry=0x7fffffff03c0, stack_val=stack_val@entry=0, addr=addr@entry=0x7ffffffef828, len=<optimized out>, 
    valp=valp@entry=0x7ffffffef830, is_register=is_register@entry=0x7ffffffef820) at dwarf/Gexpr.c:432
432     dwarf/Gexpr.c: No such file or directory.
(gdb) bt
#0  _ULx86_64_dwarf_eval_expr (c=c@entry=0x7fffffff03c0, stack_val=stack_val@entry=0, addr=addr@entry=0x7ffffffef828, len=<optimized out>, 
    valp=valp@entry=0x7ffffffef830, is_register=is_register@entry=0x7ffffffef820) at dwarf/Gexpr.c:432
#1  0x00007ffff517ac48 in eval_location_expr (as=<optimized out>, a=<optimized out>, arg=<optimized out>, locp=<synthetic pointer>, 
    addr=<optimized out>, stack_val=0, c=0x7fffffff03c0) at dwarf/Gparser.c:756
#2  apply_reg_state (c=c@entry=0x7fffffff03c0, rs=rs@entry=0x7ffffffefa98) at dwarf/Gparser.c:817
#3  0x00007ffff517b249 in _ULx86_64_dwarf_step (c=c@entry=0x7fffffff03c0) at dwarf/Gparser.c:974
#4  0x00007ffff5177116 in _ULx86_64_step (cursor=cursor@entry=0x7fffffff03c0) at x86_64/Gstep.c:75
#5  0x00007ffff51753d9 in slow_backtrace (uc=0x7ffffffefc10, size=800, buffer=0x7fffffff0840) at mi/backtrace.c:45
#6  unw_backtrace (buffer=buffer@entry=0x7fffffff0840, size=size@entry=800) at mi/backtrace.c:72
#7  0x00007ffff7fdaf45 in IgHookTrace::stacktrace (addresses=addresses@entry=0x7fffffff0840, nmax=nmax@entry=800)
    at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_4_0_pre2-slc7_amd64_gcc10/build/CMSSW_12_4_0_pre2-build/BUILD/slc7_amd64_gcc10/external/igprof/5.9.16-95dc8f7dd3ee3d76c20fd25518fc6fa9/igprof-6cc73b59d83ed6c9d73b455dc40857e700ef6ee4/src/walk-syms.cc:175
#8  0x00007ffff7fe4508 in profileSignalHandler ()
    at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_4_0_pre2-slc7_amd64_gcc10/build/CMSSW_12_4_0_pre2-build/BUILD/slc7_amd64_gcc10/external/igprof/5.9.16-95dc8f7dd3ee3d76c20fd25518fc6fa9/igprof-6cc73b59d83ed6c9d73b455dc40857e700ef6ee4/src/profile-perf.cc:66
#9  <signal handler called>
#10 0x00007fffbf3e8eae in mkfit::propagateHelixToRMPlex(Matriplex::MatriplexSym<float, 6, 4> const&, Matriplex::Matriplex<float, 6, 1, 4> const&, Matriplex::Matriplex<int, 1, 1, 4> const&, Matriplex::Matriplex<float, 1, 1, 4> const&, Matriplex::MatriplexSym<float, 6, 4>&, Matriplex::Matriplex<float, 6, 1, 4>&, int, mkfit::PropagationFlags, Matriplex::Matriplex<int, 1, 1, 4> const*) ()
   from /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_4_0_pre3/lib/slc7_amd64_gcc10/libRecoTrackerMkFitCore.so
Backtrace stopped: Cannot access memory at address 0x3ae

@gartung
Copy link
Member

gartung commented May 10, 2022

Looks like this bug might be addressed by updating libunwind.

@jpata
Copy link
Contributor Author

jpata commented May 17, 2022

Looks like it still crashed in 12_4_0_pre4: /eos/cms/store/user/cmsbuild/profiling/data/CMSSW_12_4_0_pre4/slc7_amd64_gcc10/11834.21/step3_igprof_cpu.txt

Begin processing the 183rd record. Run 1, Event 183, LumiSection 1 on stream 0 at 16-May-2022 22:13:58.719 CEST
%MSG-e TrajectoryNotPosDef:  TrackProducer:detachedTripletStepTracks  16-May-2022 22:14:02 CEST Run: 1 Event: 183
Trajectory covariance is not positive-definite
%MSG
...
#9  apply_reg_state (c=c@entry=0x7fff9609eb40, rs=rs@entry=0x7fff9609e218) at dwarf/Gparser.c:826
#10 0x00007efe34c6d25c in _ULx86_64_dwarf_step (c=c@entry=0x7fff9609eb40) at dwarf/Gparser.c:981
#11 0x00007efe34c69116 in _ULx86_64_step (cursor=cursor@entry=0x7fff9609eb40) at x86_64/Gstep.c:75
#12 0x00007efe34c673d9 in slow_backtrace (uc=0x7fff9609e390, size=800, buffer=0x7fff9609efc0) at mi/backtrace.c:45
#13 unw_backtrace (buffer=buffer@entry=0x7fff9609efc0, size=size@entry=800) at mi/backtrace.c:72
#14 0x00007efe37ad3f45 in IgHookTrace::stacktrace (addresses=addresses@entry=0x7fff9609efc0, nmax=nmax@entry=800) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_4_0_pre4-slc7_amd64_gcc10/build/CMSSW_12_4_0_pre4-build/BUILD/slc7_amd64_gcc10/external/igprof/5.9.16-b55509ef966dc5068aea62f260fb3883/igprof-6cc73b59d83ed6c9d73b455dc40857e700ef6ee4/src/walk-syms.cc:175
#15 0x00007efe37add508 in profileSignalHandler () at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_4_0_pre4-slc7_amd64_gcc10/build/CMSSW_12_4_0_pre4-build/BUILD/slc7_amd64_gcc10/external/igprof/5.9.16-b55509ef966dc5068aea62f260fb3883/igprof-6cc73b59d83ed6c9d73b455dc40857e700ef6ee4/src/profile-perf.cc:66
#16 <signal handler called>
#17 0x00007efdff9abd1a in mkfit::propagateHelixToZMPlex(Matriplex::MatriplexSym<float, 6, 4> const&, Matriplex::Matriplex<float, 6, 1, 4> const&, Matriplex::Matriplex<int, 1, 1, 4> const&, Matriplex::Matriplex<float, 1, 1, 4> const&, Matriplex::MatriplexSym<float, 6, 4>&, Matriplex::Matriplex<float, 6, 1, 4>&, int, mkfit::PropagationFlags, Matriplex::Matriplex<int, 1, 1, 4> const*) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_4_0_pre4/lib/slc7_amd64_gcc10/libRecoTrackerMkFitCore.so

Current Modules:

Module: MkFitProducer:pixelLessStepTrackCandidatesMkFit (crashed)

@gartung
Copy link
Member

gartung commented May 17, 2022

@smuzaffar has the update to libunwind/igprof been incorporated into the cmssw toolset?

@smuzaffar
Copy link
Contributor

yes @gartung , cms-sw/cmsdist#7853 has been integrated

@gartung
Copy link
Member

gartung commented May 17, 2022

@jpata can you please verify what igprof and libunwind are being used in the profiling job.

@gartung
Copy link
Member

gartung commented May 17, 2022

@smuzaffar did the updates make it into the CMSSW_12_4_0_pre4 toolset?

@smuzaffar smuzaffar reopened this May 18, 2022
@gartung
Copy link
Member

gartung commented May 18, 2022

The problem with the gperftools profile was that it had no stack traces. I was running with IGPROF environment variables defined to run igprof under gdb. This prevented gperftools libprofile.so from recording stack traces.

@gartung
Copy link
Member

gartung commented May 18, 2022

The gperftools libprofile.so also hit a segfault in libunwind.

#4  <signal handler called>
#5  dwarf_readu64 (arg=<optimized out>, val=<synthetic pointer>, addr=0x7ffd40738188, a=<optimized out>, as=<optimized out>) at ../include/dwarf_i.h:128
#6  dwarf_readw (arg=<optimized out>, val=<synthetic pointer>, addr=0x7ffd40738188, a=<optimized out>, as=<optimized out>) at ../include/dwarf_i.h:280
#7  _ULx86_64_dwarf_eval_expr (c=c@entry=0x7ffd40738ba0, stack_val=stack_val@entry=0, addr=addr@entry=0x7ffd40738408, len=<optimized out>, valp=valp@entry=0x7ffd40738410, is_register=is_register@entry=0x7ffd40738400) at dwarf/Gexpr.c:432
#8  0x00007fa6414a4c50 in eval_location_expr (as=<optimized out>, a=<optimized out>, arg=<optimized out>, locp=<synthetic pointer>, addr=<optimized out>, stack_val=0, c=0x7ffd40738ba0) at dwarf/Gparser.c:756
#9  apply_reg_state (c=c@entry=0x7ffd40738ba0, rs=rs@entry=0x7ffd40738678) at dwarf/Gparser.c:826
#10 0x00007fa6414a525c in _ULx86_64_dwarf_step (c=c@entry=0x7ffd40738ba0) at dwarf/Gparser.c:981
#11 0x00007fa6414a1116 in _ULx86_64_step (cursor=cursor@entry=0x7ffd40738ba0) at x86_64/Gstep.c:75
#12 0x00007fa6428d2ae6 in GetStackTraceWithContext_libunwind (result=<optimized out>, max_depth=253, skip_count=1081313255, ucp=<optimized out>) at src/stacktrace_libunwind-inl.h:138
#13 0x00007fa6428d332f in GetStackTraceWithContext (result=0x7ffd40739018, max_depth=253, skip_count=3, uc=0x7ffd40739880) at src/stacktrace.cc:336
#14 0x00007fa6428cffa1 in CpuProfiler::prof_handler (signal_ucontext=<optimized out>, cpu_profiler=0x7fa6428db660, sig=<optimized out>) at src/profiler.cc:353
#15 0x00007fa6428d0973 in ProfileHandler::SignalHandler (sig=27, sinfo=0x7ffd407399b0, ucontext=0x7ffd40739880) at src/profile-handler.cc:545
#16 <signal handler called>
#17 0x00007fa618734d1a in mkfit::propagateHelixToZMPlex(Matriplex::MatriplexSym<float, 6, 4> const&, Matriplex::Matriplex<float, 6, 1, 4> const&, Matriplex::Matriplex<int, 1, 1, 4> const&, Matriplex::Matriplex<float, 1, 1, 4> const&, Matriplex::MatriplexSym<float, 6, 4>&, Matriplex::Matriplex<float, 6, 1, 4>&, int, mkfit::PropagationFlags, Matriplex::Matriplex<int, 1, 1, 4> const*) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_4_0_pre4/lib/slc7_amd64_gcc10/libRecoTrackerMkFitCore.so

@gartung
Copy link
Member

gartung commented May 18, 2022

It looks like I was using an older libunwind with the original bug. I am trying gperftools with the patched libunwind and so far it hasn't crashed.

@smuzaffar
Copy link
Contributor

@gartung , can you please also check why igprof fails for el8? Many ib-run-profiling jobs for el8_amd64_gcc10 are failing with message like [a]

[a]

08:33:05 Parsing igprof output file igprofMEM_step5.1.mp:  done
08:33:05 No counter values in profile data.Reading profile data.
08:33:05 Parsing igprof output file igprofMEM_step5.10.mp:  done
08:33:05 No counter values in profile data.Reading profile data.
08:33:05 Parsing igprof output file igprofMEM_step5.19.mp:  done
08:33:05 No counter values in profile data.Reading profile data.
08:33:05 Parsing igprof output file igprofMEM_step5.mp:  done
08:33:05 No counter values in profile data.+ ERR=1
08:33:05 + false
08:33:05 + exit 1

@gartung
Copy link
Member

gartung commented May 19, 2022

The memory profiling is run with igprof -mp -o ./igprofMEM_step3.mp -- cmsRunGlibC step3_igprof.py. The version of glibc changed with el8 so I suspect something has changed that is preventing igprof from recording memory allocation/deallocation.

@gartung
Copy link
Member

gartung commented May 19, 2022

These messages probably have something to do with it

igprof -d -mp -z -o step3_igprofMEM.gz -t cmsRunGlibC cmsRunGlibC step3_igprof.py 2>&1 | tee step3_igprof_mem.txt
*** IgProf(8897, 1652964911.131): profiler activated in cmsRunGlibC, main thread id 0x7fb41b706c00
*** IgProf(8897, 1652964911.131): profiler options: igprof:out='|gzip -c>step3_igprofMEM.gz' mem
*** IgProf(8897, 1652964911.131): exit (0x7fb41c205300) + 0x0: unrecognised prologue (found 0xf3 0xf 0x1e 0xfa)
*** IgProf(8897, 1652964911.131): _exit (0x7fb41c2ae8c0) + 0x0: unrecognised prologue (found 0xf3 0xf 0x1e 0xfa)
*** IgProf(8897, 1652964911.131): kill (0x7fb41c1d7460) + 0x0: unrecognised prologue (found 0xf3 0xf 0x1e 0xfa)
*** IgProf(8897, 1652964911.131): pthread_create (0x7fb41c5814b0) + 0x0: unrecognised prologue (found 0xf3 0xf 0x1e 0xfa)
*** IgProf(8897, 1652964911.131): dlsym(self, 'pthread_create'): cmsRunGlibC: undefined symbol: pthread_create, version GLIBC_2.0
*** IgProf(8897, 1652964911.131): dlsym(self, 'pthread_create'): cmsRunGlibC: undefined symbol: pthread_create, version GLIBC_2.1
*** IgProf(8897, 1652964911.131): memory profiler: reporting memory use without allocation overhead
*** IgProf(8897, 1652964911.131): malloc (0x7fb41c250660) + 0x0: unrecognised prologue (found 0xf3 0xf 0x1e 0xfa)
*** IgProf(8897, 1652964911.131): calloc (0x7fb41c251420) + 0x0: unrecognised prologue (found 0xf3 0xf 0x1e 0xfa)
*** IgProf(8897, 1652964911.131): realloc (0x7fb41c250f10) + 0x0: unrecognised prologue (found 0xf3 0xf 0x1e 0xfa)
*** IgProf(8897, 1652964911.131): posix_memalign (0x7fb41c2520e0) + 0x0: unrecognised prologue (found 0xf3 0xf 0x1e 0xfa)
*** IgProf(8897, 1652964911.131): memalign (0x7fb41c251360) + 0x0: unrecognised prologue (found 0xf3 0xf 0x1e 0xfa)
*** IgProf(8897, 1652964911.131): pvalloc (0x7fb41c2513b0) + 0x0: unrecognised prologue (found 0xf3 0xf 0x1e 0xfa)
*** IgProf(8897, 1652964911.131): valloc (0x7fb41c251370) + 0x0: unrecognised prologue (found 0xf3 0xf 0x1e 0xfa)
*** IgProf(8897, 1652964911.131): free (0x7fb41c250cb0) + 0x0: unrecognised prologue (found 0xf3 0xf 0x1e 0xfa)
*** IgProf(8897, 1652964911.131): dlsym(self, 'tc_new'): /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_0_pre4/external/el8_amd64_gcc10/lib/libigprof.so: undefined symbol: tc_new
*** IgProf(8897, 1652964911.149): findsym('tc_new', '(null)', 4): symbol not found
*** IgProf(8897, 1652964911.149): dlsym(self, 'tc_new_nothrow'): /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_0_pre4/external/el8_amd64_gcc10/lib/libigprof.so: undefined symbol: tc_new_nothrow
*** IgProf(8897, 1652964911.166): findsym('tc_new_nothrow', '(null)', 4): symbol not found
*** IgProf(8897, 1652964911.166): dlsym(self, 'tc_newarray'): /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_0_pre4/external/el8_amd64_gcc10/lib/libigprof.so: undefined symbol: tc_newarray
*** IgProf(8897, 1652964911.183): findsym('tc_newarray', '(null)', 4): symbol not found
*** IgProf(8897, 1652964911.183): dlsym(self, 'tc_newarray_nothrow'): /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_0_pre4/external/el8_amd64_gcc10/lib/libigprof.so: undefined symbol: tc_newarray_nothrow
*** IgProf(8897, 1652964911.201): findsym('tc_newarray_nothrow', '(null)', 4): symbol not found
*** IgProf(8897, 1652964911.201): dlsym(self, 'tc_malloc'): /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_0_pre4/external/el8_amd64_gcc10/lib/libigprof.so: undefined symbol: tc_malloc
*** IgProf(8897, 1652964911.218): findsym('tc_malloc', '(null)', 4): symbol not found
*** IgProf(8897, 1652964911.219): dlsym(self, 'tc_calloc'): /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_0_pre4/external/el8_amd64_gcc10/lib/libigprof.so: undefined symbol: tc_calloc
*** IgProf(8897, 1652964911.236): findsym('tc_calloc', '(null)', 4): symbol not found
*** IgProf(8897, 1652964911.236): dlsym(self, 'tc_realloc'): /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_0_pre4/external/el8_amd64_gcc10/lib/libigprof.so: undefined symbol: tc_realloc
*** IgProf(8897, 1652964911.253): findsym('tc_realloc', '(null)', 4): symbol not found
*** IgProf(8897, 1652964911.253): dlsym(self, 'tc_posix_memalign'): /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_0_pre4/external/el8_amd64_gcc10/lib/libigprof.so: undefined symbol: tc_posix_memalign
*** IgProf(8897, 1652964911.271): findsym('tc_posix_memalign', '(null)', 4): symbol not found
*** IgProf(8897, 1652964911.271): dlsym(self, 'tc_memalign'): /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_0_pre4/external/el8_amd64_gcc10/lib/libigprof.so: undefined symbol: tc_memalign
*** IgProf(8897, 1652964911.288): findsym('tc_memalign', '(null)', 4): symbol not found
*** IgProf(8897, 1652964911.289): dlsym(self, 'tc_pvalloc'): /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_0_pre4/external/el8_amd64_gcc10/lib/libigprof.so: undefined symbol: tc_pvalloc
*** IgProf(8897, 1652964911.306): findsym('tc_pvalloc', '(null)', 4): symbol not found
*** IgProf(8897, 1652964911.306): dlsym(self, 'tc_valloc'): /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_0_pre4/external/el8_amd64_gcc10/lib/libigprof.so: undefined symbol: tc_valloc
*** IgProf(8897, 1652964911.323): findsym('tc_valloc', '(null)', 4): symbol not found
*** IgProf(8897, 1652964911.324): dlsym(self, 'tc_free'): /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_0_pre4/external/el8_amd64_gcc10/lib/libigprof.so: undefined symbol: tc_free
*** IgProf(8897, 1652964911.341): findsym('tc_free', '(null)', 4): symbol not found
*** IgProf(8897, 1652964911.341): dlsym(self, 'tc_cfree'): /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_0_pre4/external/el8_amd64_gcc10/lib/libigprof.so: undefined symbol: tc_cfree
*** IgProf(8897, 1652964911.358): findsym('tc_cfree', '(null)', 4): symbol not found
*** IgProf(8897, 1652964911.358): dlsym(self, 'tc_delete'): /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_0_pre4/external/el8_amd64_gcc10/lib/libigprof.so: undefined symbol: tc_delete
*** IgProf(8897, 1652964911.376): findsym('tc_delete', '(null)', 4): symbol not found
*** IgProf(8897, 1652964911.376): dlsym(self, 'tc_delete_nothrow'): /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_0_pre4/external/el8_amd64_gcc10/lib/libigprof.so: undefined symbol: tc_delete_nothrow
*** IgProf(8897, 1652964911.393): findsym('tc_delete_nothrow', '(null)', 4): symbol not found
*** IgProf(8897, 1652964911.393): dlsym(self, 'tc_deletearray'): /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_0_pre4/external/el8_amd64_gcc10/lib/libigprof.so: undefined symbol: tc_deletearray
*** IgProf(8897, 1652964911.410): findsym('tc_deletearray', '(null)', 4): symbol not found
*** IgProf(8897, 1652964911.410): dlsym(self, 'tc_deletearray_nothrow'): /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_0_pre4/external/el8_amd64_gcc10/lib/libigprof.so: undefined symbol: tc_deletearray_nothrow
*** IgProf(8897, 1652964911.427): findsym('tc_deletearray_nothrow', '(null)', 4): symbol not found
*** IgProf(8897, 1652964911.428): dlsym(self, 'tc_delete_sized'): /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_0_pre4/external/el8_amd64_gcc10/lib/libigprof.so: undefined symbol: tc_delete_sized
*** IgProf(8897, 1652964911.445): findsym('tc_delete_sized', '(null)', 4): symbol not found
*** IgProf(8897, 1652964911.445): memory profiler enabled

@gartung
Copy link
Member

gartung commented May 19, 2022

@smuzaffar Can you disable the retries for now. Once is enough to verify that memory profiling is broken.

@smuzaffar
Copy link
Contributor

smuzaffar commented May 19, 2022

ok , retires disbaled now

@gartung
Copy link
Member

gartung commented May 19, 2022

Using cmsRun instead of cmsRunGlibC produces a memory profile. @jpata you will need to update the release profiling script.

@slava77
Copy link
Contributor

slava77 commented May 19, 2022

Using cmsRun instead of cmsRunGlibC produces a memory profile. @jpata you will need to update the release profiling script.

this wasn't capable to record deallocations previously. Was it fixed?

@gartung
Copy link
Member

gartung commented May 19, 2022

How would I tell? I have memory profiles produced locally. What would tell you if the deallocations were recorded correctly?

@slava77
Copy link
Contributor

slava77 commented May 19, 2022

How would I tell? I have memory profiles produced locally. What would tell you if the deallocations were recorded correctly?

IIRC, MEM_LIVE will be closer to MEM_TOTAL (but perhaps it was just a factor of a few larger than the normal; I don't remember that well).

@gartung
Copy link
Member

gartung commented May 20, 2022

Using cmsRun on el8 and looking at MEM_LIVE and MEM_TOTAL on the second to last event shows the they are not close in value

==> IgProf.39.MEM_LIVE.txt <==
Counter: MEM_LIVE

----------------------------------------------------------------------
Flat profile (cumulative >= 1%)

% total           Total           Calls  Function
  100.0  20'367'236'321      19'805'715  <spontaneous> [1]
  100.0  20'367'234'873      19'805'708  __libc_start_main [2]
  100.0  20'367'234'873      19'805'708  _start [3]
  100.0  20'367'234'873      19'805'708  main [4]

==> IgProf.39.MEM_TOTAL.txt <==
Counter: MEM_TOTAL

----------------------------------------------------------------------
Flat profile (cumulative >= 1%)

% total            Total            Calls  Function
  100.0  185'828'883'715      968'688'276  <spontaneous> [1]
  100.0  185'828'803'507      968'685'669  main [2]
  100.0  185'828'802'307      968'685'619  __libc_start_main [3]
  100.0  185'828'801'323      968'685'578  _start [4]

@slava77
Copy link
Contributor

slava77 commented May 20, 2022

Using cmsRun on el8 and looking at MEM_LIVE and MEM_TOTAL on the second to last event shows the they are not close in value

Thanks for checking; apparently it's the other case from my attempt to recall it was just a factor of a few larger than the normal.

20 GB in IgProf.39.MEM_LIVE.txt looks wrong.
In case you are looking at the Run3 job, the value should be closer to 2GB for step3.

@gartung
Copy link
Member

gartung commented May 20, 2022

Yes this is for the Run3 job.

@gartung
Copy link
Member

gartung commented May 20, 2022

I am running a CPU profiling job with the same step3 and 400 events on el8 to see if it segfaults. Although the recent patch makes it occur less often.

@gartung
Copy link
Member

gartung commented May 20, 2022

Jemalloc, which I think is linked into cmsRun, has a profiling option
https://github.com/jemalloc/jemalloc/wiki/Use-Case%3A-Heap-Profiling

@gartung
Copy link
Member

gartung commented May 20, 2022

CPU profiling crashed around event 369 running on el8.

#4  <signal handler called>
#5  dwarf_readu64 (arg=<optimized out>, val=<synthetic pointer>, addr=0x7ffd4c594bc8, a=<optimized out>, as=<optimized out>) at ../include/dwarf_i.h:128
#6  dwarf_readw (arg=<optimized out>, val=<synthetic pointer>, addr=0x7ffd4c594bc8, a=<optimized out>, as=<optimized out>) at ../include/dwarf_i.h:280
#7  _ULx86_64_dwarf_eval_expr (c=c@entry=0x7ffd4c595a00, stack_val=stack_val@entry=0, addr=addr@entry=0x7ffd4c594e48, len=<optimized out>, valp=valp@entry=0x7ffd4c594e50, is_register=is_register@entry=0x7ffd4c594e40) at dwarf/Gexpr.c:432
#8  0x00007f601544dc50 in eval_location_expr (as=<optimized out>, a=<optimized out>, arg=<optimized out>, locp=<synthetic pointer>, addr=<optimized out>, stack_val=0, c=0x7ffd4c595a00) at dwarf/Gparser.c:756
#9  apply_reg_state (c=c@entry=0x7ffd4c595a00, rs=rs@entry=0x7ffd4c5950b8) at dwarf/Gparser.c:826
#10 0x00007f601544e25c in _ULx86_64_dwarf_step (c=c@entry=0x7ffd4c595a00) at dwarf/Gparser.c:981
#11 0x00007f601544a116 in _ULx86_64_step (cursor=cursor@entry=0x7ffd4c595a00) at x86_64/Gstep.c:75
#12 0x00007f60154483d9 in slow_backtrace (uc=0x7ffd4c595230, size=800, buffer=0x7ffd4c595e80) at mi/backtrace.c:45
#13 unw_backtrace (buffer=buffer@entry=0x7ffd4c595e80, size=size@entry=800) at mi/backtrace.c:72
#14 0x00007f6018324f35 in IgHookTrace::stacktrace (addresses=addresses@entry=0x7ffd4c595e80, nmax=nmax@entry=800) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_4_0_pre4-el8_amd64_gcc10/build/CMSSW_12_4_0_pre4-build/BUILD/el8_amd64_gcc10/external/igprof/5.9.16-b55509ef966dc5068aea62f260fb3883/igprof-6cc73b59d83ed6c9d73b455dc40857e700ef6ee4/src/walk-syms.cc:175
#15 0x00007f601832e4f8 in profileSignalHandler () at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_4_0_pre4-el8_amd64_gcc10/build/CMSSW_12_4_0_pre4-build/BUILD/el8_amd64_gcc10/external/igprof/5.9.16-b55509ef966dc5068aea62f260fb3883/igprof-6cc73b59d83ed6c9d73b455dc40857e700ef6ee4/src/profile-perf.cc:66
#16 <signal handler called>
#17 0x00007f5fe17de472 in mkfit::kalmanOperation(int, Matriplex::MatriplexSym<float, 6, 4> const&, Matriplex::Matriplex<float, 6, 1, 4> const&, Matriplex::MatriplexSym<float, 3, 4> const&, Matriplex::Matriplex<float, 3, 1, 4> const&, Matriplex::MatriplexSym<float, 6, 4>&, Matriplex::Matriplex<float, 6, 1, 4>&, Matriplex::Matriplex<float, 1, 1, 4>&, int) () from /home/users/gartung/CMSSW_12_4_0_pre4/lib/el8_amd64_gcc10/libRecoTrackerMkFitCore.so

@gartung
Copy link
Member

gartung commented May 25, 2022

I found a fix for the memory profiling problem on el8. I will reverse this pull request once the Igprof fix goes in.
cms-externals/igprof#3

@gartung
Copy link
Member

gartung commented Jun 3, 2022

Still seeing the occasional segfault in cpu profiling

#4  <signal handler called>
#5  dwarf_readu64 (arg=<optimized out>, val=<synthetic pointer>, addr=0x7fffff393a08, a=<optimized out>, as=<optimized out>) at ../include/dwarf_i.h:128
#6  dwarf_readw (arg=<optimized out>, val=<synthetic pointer>, addr=0x7fffff393a08, a=<optimized out>, as=<optimized out>) at ../include/dwarf_i.h:280
#7  _ULx86_64_dwarf_eval_expr (c=c@entry=0x7fffff394840, stack_val=stack_val@entry=0, addr=addr@entry=0x7fffff393c88, len=<optimized out>, valp=valp@entry=0x7fffff393c90, is_register=is_register@entry=0x7fffff393c80) at dwarf/Gexpr.c:432
#8  0x00007f29bbbe1c50 in eval_location_expr (as=<optimized out>, a=<optimized out>, arg=<optimized out>, locp=<synthetic pointer>, addr=<optimized out>, stack_val=0, c=0x7fffff394840) at dwarf/Gparser.c:756
#9  apply_reg_state (c=c@entry=0x7fffff394840, rs=rs@entry=0x7fffff393ef8) at dwarf/Gparser.c:826
#10 0x00007f29bbbe225c in _ULx86_64_dwarf_step (c=c@entry=0x7fffff394840) at dwarf/Gparser.c:981
#11 0x00007f29bbbde116 in _ULx86_64_step (cursor=cursor@entry=0x7fffff394840) at x86_64/Gstep.c:75
#12 0x00007f29bbbdc3d9 in slow_backtrace (uc=0x7fffff394070, size=800, buffer=0x7fffff394cc0) at mi/backtrace.c:45
#13 unw_backtrace (buffer=buffer@entry=0x7fffff394cc0, size=size@entry=800) at mi/backtrace.c:72
#14 0x00007f29beab1f65 in IgHookTrace::stacktrace (addresses=addresses@entry=0x7fffff394cc0, nmax=nmax@entry=800) at /data/cmsbld/jenkins/workspace/build-any-ib/w/BUILD/el8_amd64_gcc10/external/igprof/5.9.16-70108befbf35d571a521c8cd66ab22b6/igprof-fed6dc47309b008b2f2bbafd270604286de6ab54/src/walk-syms.cc:175
#15 0x00007f29beabb528 in profileSignalHandler () at /data/cmsbld/jenkins/workspace/build-any-ib/w/BUILD/el8_amd64_gcc10/external/igprof/5.9.16-70108befbf35d571a521c8cd66ab22b6/igprof-fed6dc47309b008b2f2bbafd270604286de6ab54/src/profile-perf.cc:66
#16 <signal handler called>
#17 0x00007f29851c5472 in mkfit::kalmanOperation(int, Matriplex::MatriplexSym<float, 6, 4> const&, Matriplex::Matriplex<float, 6, 1, 4> const&, Matriplex::MatriplexSym<float, 3, 4> const&, Matriplex::Matriplex<float, 3, 1, 4> const&, Matriplex::MatriplexSym<float, 6, 4>&, Matriplex::Matriplex<float, 6, 1, 4>&, Matriplex::Matriplex<float, 1, 1, 4>&, int) () from /cvmfs/cms-ib.cern.ch/nweek-02735/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-06-01-2300/lib/el8_amd64_gcc10/libRecoTrackerMkFitCore.so

@gartung
Copy link
Member

gartung commented Jun 6, 2022

Updated libunwind to the head of master to further help with segfaults in the dwarf_* functions
cms-sw/cmsdist#7910

@jpata
Copy link
Contributor Author

jpata commented Aug 31, 2022

+reconstruction

@cmsbuild
Copy link
Contributor

This issue is fully signed and ready to be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants