debug(ci): capture gdb backtrace on OverlapCheck segfault#29
Closed
olantwin wants to merge 4 commits into
Closed
Conversation
All dependencies are now available from the prefix.dev/ship channel (via ship-conda-recipes), so CI no longer needs CVMFS mounts, the CERN container, or a self-hosted runner. The new workflow runs on ubuntu-latest with prefix-dev/setup-pixi and a single `pixi run test` step that configures, builds, and runs ctest.
Picks up the geomodel rebuild from ShipSoft/ship-conda-recipes#15, which restores ABI compatibility with the current conda-forge cxx-compiler (libstdcxx >= 15). Confirmed locally that the full test suite, including OverlapCheck, now passes without the std::format workaround from #27.
ShipSoft/ship-conda-recipes#17 added a linker version script hiding std::__format weak template instantiations from libGeoModelXml/Write/ Read/DBManager exports. Local symbol audit now reports 0 weak _ZNSt8__format symbols in those libraries (was 4 each on build 2), and the full test suite — including OverlapCheck — passes in this environment.
PR #26's OverlapCheck test segfaults on GHA ubuntu-latest after "Copy 54 x=1300 y=3220" from libGeoModelXml's Gmx2Geo, but cannot be reproduced locally (NixOS pixi env or Ubuntu 24.04 container with matching glibc 2.39 — both pass 25/25; valgrind reports 0 errors). Three rounds of geomodel rebuilds in ship-conda-recipes (build_number bump, -fvisibility-inlines-hidden, linker version script) have not helped. Add gdb to the pixi env and wrap the failing test step with a diagnostic step that runs build_geometry under gdb --batch on failure, printing the full backtrace into the action log. Once we know the actual crash site, this commit gets reverted (or gated behind a workflow_dispatch input) before merge. Diagnostic commit — DO NOT MERGE TO main.
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
olantwin
added a commit
to ShipSoft/ship-conda-recipes
that referenced
this pull request
Jun 5, 2026
PR #17 added `-Wl,--version-script={local: _ZNSt8__format*;}` which hid `_ZN…`-mangled function symbols. Local audit confirmed 0 weak `_ZN` exports per geomodel lib. But CI debug PR (ShipSoft/Geometry#29) captured a backtrace showing the crash is still in libGeoModelWrite: Program received signal SIGSEGV #0 std::__format::_Sink<char>::_M_write(…) from libGeoModelWrite.so.6 #1 std::__format::_Formatting_scanner<…>::_M_on_chars(…) from libGeoModelWrite.so.6 #2 std::__format::_Scanner<char>::_M_scan(…) at <format>:3942 … #7 SHiPGeometry::CalorimeterFactory::buildStack(…) at CalorimeterFactory.cpp:184 `nm -D libGeoModelWrite.so.6 | grep __format` still shows 15 entries (all `V` vague-linkage) for vtables (`_ZTV…`), typeinfo (`_ZTI…`), and typeinfo names (`_ZTS…`) of _Sink<char>, _Buf_sink, _Fixedbuf_sink, _Seq_sink<string>, _Scanner, _Formatting_scanner. Those are not matched by `_ZNSt8__format*` (they start with `_ZTV`/`_ZTI`/`_ZTS`, not `_ZN`), so build_geometry's `_Seq_sink<string>` constructor still binds its vptr to libGeoModelWrite's vtable, and the virtual call lands in libGeoModelWrite's compiled-with-different-headers `_M_write`. Broaden the version script pattern to `*NSt8__format*` so vtables, typeinfo, typeinfo names, and any future nested std::__format symbols all get localized. Bump build.number to 4.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Diagnostic only — not for merge.
PR #26's OverlapCheck test segfaults on GHA
ubuntu-latestafterCopy 54 x=1300 y=3220from libGeoModelXml'sGmx2Geo, but cannot be reproduced anywhere I can run gdb:ulimit -s 2048ubuntu-latestThree rounds of geomodel rebuilds via
ship-conda-recipes(#15 bump, #16-fvisibility-inlines-hidden, #17--version-script={local: _ZNSt8__format*;}) have not helped — the rebuild that drops 4 → 0 weak_ZNSt8__formatexports per geomodel lib still crashes at the identical line on CI.This PR adds
gdbto the pixi env and wraps the test step so that on failure,build_geometryis re-run undergdb --batch --ex run --ex 'thread apply all bt full'. The full backtrace lands in the action log — first frame should tell us whether the crash lives inGmx2Geo, libstdc++ format machinery, SHiP'sSHiPTimingDetInterface, or somewhere else.Once we have the backtrace, this PR is closed and the diagnostic is reverted.