New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
igprof pp segfault in 12_3_0, 12_3_0_pre4 in the Run3 reco step #37816
Comments
assign reconstruction |
New categories assigned: reconstruction @jpata,@slava77,@clacaputo you have been requested to review this Pull request/Issue and eventually sign? Thanks |
A new Issue was created by @jpata Joosep Pata. @Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
@jpata |
It's not the same event (but roughly ~230...260 events in) in different releases, nor the same module (highPtTripletStepTrackCandidatesMkFit vs. detachedTripletStepTrackCandidatesMkFit). |
there were no recent updates in the propagate or kalman update routines. |
It could be some new TBB thread behavior. I could try running the work flow with gpertools cpu profiling enabled to see if has the same segfault. Is this related to the segfault in propagateHelixToZMPlex when fast-math optimizations are enabled? |
This is the fix that Steve Lantz suggested
|
The crash is caused by the () operator being optimized out. |
@gartung do we know why the inlining / optimization of the code being profiled causes the profiler to crash? |
I will run the profiler with debugging statements enabled to see if that tells me anything. The last segfault was "fixed" by updating libunwind. |
I will also run the gpertools profiler to see if it segfaults. |
First run through with the same file did not produce a segfault. |
It segfaults reliably about halfway through the 500 events if you don't use gdb or don't recompile with debugging symbols, I just tried again on vocms011. Is it possible that gdb or debug symbols make the crash go away? |
is the job run inside a Singularity image? |
Nevermind. I did get a crash on the second run without gdb that I started Friday evening. |
The segfault is in libunwind or libdwarf. I am attempting catch the segault in gdb. |
I am trying this patch |
Patch was incorporated into head of 1.6-stable branch in Dec2021. I am updating the libunwind spec to use the latest commit on that branch. I will test with the updated libunwind. |
Captured backtrace in debugger
|
Looks like this bug might be addressed by updating libunwind. |
Looks like it still crashed in 12_4_0_pre4: /eos/cms/store/user/cmsbuild/profiling/data/CMSSW_12_4_0_pre4/slc7_amd64_gcc10/11834.21/step3_igprof_cpu.txt
|
@smuzaffar has the update to libunwind/igprof been incorporated into the cmssw toolset? |
yes @gartung , cms-sw/cmsdist#7853 has been integrated |
@jpata can you please verify what igprof and libunwind are being used in the profiling job. |
@smuzaffar did the updates make it into the CMSSW_12_4_0_pre4 toolset? |
The problem with the gperftools profile was that it had no stack traces. I was running with IGPROF environment variables defined to run igprof under gdb. This prevented gperftools libprofile.so from recording stack traces. |
The gperftools libprofile.so also hit a segfault in libunwind.
|
It looks like I was using an older libunwind with the original bug. I am trying gperftools with the patched libunwind and so far it hasn't crashed. |
@gartung , can you please also check why igprof fails for el8? Many ib-run-profiling jobs for [a]
|
The memory profiling is run with |
These messages probably have something to do with it
|
@smuzaffar Can you disable the retries for now. Once is enough to verify that memory profiling is broken. |
ok , retires disbaled now |
Using cmsRun instead of cmsRunGlibC produces a memory profile. @jpata you will need to update the release profiling script. |
this wasn't capable to record deallocations previously. Was it fixed? |
How would I tell? I have memory profiles produced locally. What would tell you if the deallocations were recorded correctly? |
IIRC, MEM_LIVE will be closer to MEM_TOTAL (but perhaps it was just a factor of a few larger than the normal; I don't remember that well). |
Using cmsRun on el8 and looking at MEM_LIVE and MEM_TOTAL on the second to last event shows the they are not close in value
|
Thanks for checking; apparently it's the other case from my attempt to recall 20 GB in IgProf.39.MEM_LIVE.txt looks wrong. |
Yes this is for the Run3 job. |
I am running a CPU profiling job with the same step3 and 400 events on el8 to see if it segfaults. Although the recent patch makes it occur less often. |
Jemalloc, which I think is linked into cmsRun, has a profiling option |
CPU profiling crashed around event 369 running on el8.
|
I found a fix for the memory profiling problem on el8. I will reverse this pull request once the Igprof fix goes in. |
Still seeing the occasional segfault in cpu profiling
|
Updated libunwind to the head of master to further help with segfaults in the dwarf_* functions |
+reconstruction
|
This issue is fully signed and ready to be closed. |
In two recent releases,
igprof pp
crashes in the reco step in 11834.21:In both cases, the current module is MkFitProducer. Is it a coincidence, or do we have a regression?
Note that
igprof mp
does not crash in these workflows, and the crash happens around event 230-260. Since jenkins tries to run igprof several times in case of failure, it looks like it's reproducible.@slava77 @gartung
The text was updated successfully, but these errors were encountered: