Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix gc crash #4944

Merged
merged 3 commits into from
Jun 29, 2021
Merged

Fix gc crash #4944

merged 3 commits into from
Jun 29, 2021

Conversation

roberth
Copy link
Member

@roberth roberth commented Jun 24, 2021

I've figured out another cause of gc crashes, related to coroutines.

This now solves

  • a crash that occurs whenever a gc-registered thread is in a coroutine (the boehmgc patch)
  • an inappropriate "stack overflow" when gc runs while a coroutine exists (the BoehmGCStackAllocator commit)

I'll run nixUnstable with this PR and #4895 to see if any other crashes occur.

This fixes a crash that looks like:

```
Thread 1 "nix-build" received signal SIGSEGV, Segmentation fault.
0x00007ffff7ad22a0 in GC_push_all_eager () from /nix/store/p1z58l18klf88iijpd0qi8yd2n9lhlk4-boehm-gc-8.0.4/lib/libgc.so.1
(gdb) bt
0  0x00007ffff7ad22a0 in GC_push_all_eager () from /nix/store/p1z58l18klf88iijpd0qi8yd2n9lhlk4-boehm-gc-8.0.4/lib/libgc.so.1
1  0x00007ffff7adeefb in GC_push_all_stacks () from /nix/store/p1z58l18klf88iijpd0qi8yd2n9lhlk4-boehm-gc-8.0.4/lib/libgc.so.1
2  0x00007ffff7ad5ac7 in GC_mark_some () from /nix/store/p1z58l18klf88iijpd0qi8yd2n9lhlk4-boehm-gc-8.0.4/lib/libgc.so.1
3  0x00007ffff7ad77bd in GC_stopped_mark () from /nix/store/p1z58l18klf88iijpd0qi8yd2n9lhlk4-boehm-gc-8.0.4/lib/libgc.so.1
4  0x00007ffff7adbe3a in GC_try_to_collect_inner.part.0 () from /nix/store/p1z58l18klf88iijpd0qi8yd2n9lhlk4-boehm-gc-8.0.4/lib/libgc.so.1
5  0x00007ffff7adc2a2 in GC_collect_or_expand () from /nix/store/p1z58l18klf88iijpd0qi8yd2n9lhlk4-boehm-gc-8.0.4/lib/libgc.so.1
6  0x00007ffff7adc4f8 in GC_allocobj () from /nix/store/p1z58l18klf88iijpd0qi8yd2n9lhlk4-boehm-gc-8.0.4/lib/libgc.so.1
7  0x00007ffff7adc88f in GC_generic_malloc_inner () from /nix/store/p1z58l18klf88iijpd0qi8yd2n9lhlk4-boehm-gc-8.0.4/lib/libgc.so.1
8  0x00007ffff7ae1a04 in GC_generic_malloc_many () from /nix/store/p1z58l18klf88iijpd0qi8yd2n9lhlk4-boehm-gc-8.0.4/lib/libgc.so.1
9  0x00007ffff7ae1c72 in GC_malloc_kind () from /nix/store/p1z58l18klf88iijpd0qi8yd2n9lhlk4-boehm-gc-8.0.4/lib/libgc.so.1
10 0x00007ffff7e003d6 in nix::EvalState::allocValue() () from /nix/store/hzdzcv9d3bc8rlsaphh7x54zsf0x8nx6-nix-2.4pre20210601_5985b8b/lib/libnixexpr.so
11 0x00007ffff7e04b9c in nix::EvalState::callPrimOp(nix::Value&, nix::Value&, nix::Value&, nix::Pos const&) () from /nix/store/hzdzcv9d3bc8rlsaphh7x54zsf0x8nx6-nix-2.4pre20210601_5985b8b/lib/libnixexpr.so
12 0x00007ffff7e0a773 in nix::EvalState::callFunction(nix::Value&, nix::Value&, nix::Value&, nix::Pos const&) () from /nix/store/hzdzcv9d3bc8rlsaphh7x54zsf0x8nx6-nix-2.4pre20210601_5985b8b/lib/libnixexpr.so
13 0x00007ffff7e0a91d in nix::ExprApp::eval(nix::EvalState&, nix::Env&, nix::Value&) () from /nix/store/hzdzcv9d3bc8rlsaphh7x54zsf0x8nx6-nix-2.4pre20210601_5985b8b/lib/libnixexpr.so
14 0x00007ffff7e0a8f8 in nix::ExprApp::eval(nix::EvalState&, nix::Env&, nix::Value&) () from /nix/store/hzdzcv9d3bc8rlsaphh7x54zsf0x8nx6-nix-2.4pre20210601_5985b8b/lib/libnixexpr.so
15 0x00007ffff7e0e0e8 in nix::ExprOpNEq::eval(nix::EvalState&, nix::Env&, nix::Value&) () from /nix/store/hzdzcv9d3bc8rlsaphh7x54zsf0x8nx6-nix-2.4pre20210601_5985b8b/lib/libnixexpr.so
16 0x00007ffff7e0d708 in nix::ExprOpOr::eval(nix::EvalState&, nix::Env&, nix::Value&) () from /nix/store/hzdzcv9d3bc8rlsaphh7x54zsf0x8nx6-nix-2.4pre20210601_5985b8b/lib/libnixexpr.so
17 0x00007ffff7e0d695 in nix::ExprOpOr::eval(nix::EvalState&, nix::Env&, nix::Value&) () from /nix/store/hzdzcv9d3bc8rlsaphh7x54zsf0x8nx6-nix-2.4pre20210601_5985b8b/lib/libnixexpr.so
18 0x00007ffff7e0d695 in nix::ExprOpOr::eval(nix::EvalState&, nix::Env&, nix::Value&) () from /nix/store/hzdzcv9d3bc8rlsaphh7x54zsf0x8nx6-nix-2.4pre20210601_5985b8b/lib/libnixexpr.so
19 0x00007ffff7e0d695 in nix::ExprOpOr::eval(nix::EvalState&, nix::Env&, nix::Value&) () from /nix/store/hzdzcv9d3bc8rlsaphh7x54zsf0x8nx6-nix-2.4pre20210601_5985b8b/lib/libnixexpr.so
20 0x00007ffff7e0d695 in nix::ExprOpOr::eval(nix::EvalState&, nix::Env&, nix::Value&) () from /nix/store/hzdzcv9d3bc8rlsaphh7x54zsf0x8nx6-nix-2.4pre20210601_5985b8b/lib/libnixexpr.so
21 0x00007ffff7e09e19 in nix::ExprOpNot::eval(nix::EvalState&, nix::Env&, nix::Value&) () from /nix/store/hzdzcv9d3bc8rlsaphh7x54zsf0x8nx6-nix-2.4pre20210601_5985b8b/lib/libnixexpr.so
22 0x00007ffff7e0a792 in nix::EvalState::callFunction(nix::Value&, nix::Value&, nix::Value&, nix::Pos const&) () from /nix/store/hzdzcv9d3bc8rlsaphh7x54zsf0x8nx6-nix-2.4pre20210601_5985b8b/lib/libnixexpr.so
23 0x00007ffff7e8cba0 in nix::addPath(nix::EvalState&, nix::Pos const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, nix::Value*, nix::FileIngestionMethod, std::optional<nix::Hash>, nix::Value&)::{lambda(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)#1}::operator()(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const () from /nix/store/hzdzcv9d3bc8rlsaphh7x54zsf0x8nx6-nix-2.4pre20210601_5985b8b/lib/libnixexpr.so
24 0x00007ffff752e6f9 in nix::dump(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, nix::Sink&, std::function<bool (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)>&) () from /nix/store/hzdzcv9d3bc8rlsaphh7x54zsf0x8nx6-nix-2.4pre20210601_5985b8b/lib/libnixutil.so
25 0x00007ffff752e8e2 in nix::dump(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, nix::Sink&, std::function<bool (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)>&) () from /nix/store/hzdzcv9d3bc8rlsaphh7x54zsf0x8nx6-nix-2.4pre20210601_5985b8b/lib/libnixutil.so
26 0x00007ffff752e8e2 in nix::dump(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, nix::Sink&, std::function<bool (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)>&) () from /nix/store/hzdzcv9d3bc8rlsaphh7x54zsf0x8nx6-nix-2.4pre20210601_5985b8b/lib/libnixutil.so
27 0x00007ffff752e8e2 in nix::dump(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, nix::Sink&, std::function<bool (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)>&) () from /nix/store/hzdzcv9d3bc8rlsaphh7x54zsf0x8nx6-nix-2.4pre20210601_5985b8b/lib/libnixutil.so
28 0x00007ffff752e8e2 in nix::dump(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, nix::Sink&, std::function<bool (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)>&) () from /nix/store/hzdzcv9d3bc8rlsaphh7x54zsf0x8nx6-nix-2.4pre20210601_5985b8b/lib/libnixutil.so
29 0x00007ffff752e8e2 in nix::dump(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, nix::Sink&, std::function<bool (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)>&) () from /nix/store/hzdzcv9d3bc8rlsaphh7x54zsf0x8nx6-nix-2.4pre20210601_5985b8b/lib/libnixutil.so
30 0x00007ffff757f8c0 in void boost::context::detail::fiber_entry<boost::context::detail::fiber_record<boost::context::fiber, nix::VirtualStackAllocator, boost::coroutines2::detail::pull_coroutine<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >::control_block::control_block<nix::VirtualStackAllocator, nix::sinkToSource(std::function<void (nix::Sink&)>, std::function<void ()>)::SinkToSource::read(char*, unsigned long)::{lambda(boost::coroutines2::detail::push_coroutine<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >&)#1}>(boost::context::preallocated, nix::VirtualStackAllocator&&, nix::sinkToSource(std::function<void (nix::Sink&)>, std::function<void ()>)::SinkToSource::read(char*, unsigned long)::{lambda(boost::coroutines2::detail::push_coroutine<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >&)#1}&&)::{lambda(boost::context::fiber&&)#1}> >(boost::context::detail::transfer_t) () from /nix/store/hzdzcv9d3bc8rlsaphh7x54zsf0x8nx6-nix-2.4pre20210601_5985b8b/lib/libnixutil.so
31 0x00007ffff6f331ef in make_fcontext () from /nix/store/hzdzcv9d3bc8rlsaphh7x54zsf0x8nx6-nix-2.4pre20210601_5985b8b/lib/libboost_context.so.1.69.0
32 0x0000000000000000 in ?? ()
```
Fixes the problem where a stack pointer outside the original
thread causes the collector to crash.

It could be made more accurate by recording the stack pointer
every time we switch to a coroutine. We can use this information
to update our own coroutine stacks like normal data. When the
stack pointer is on a thread, we can add a field to GC_thread
"fallback_sp" to be used when the thread sp is outside the original
thread range.
@jonringer
Copy link
Contributor

should the boehm-gc patch be upstreamed? looks like it addresses some of the FIXME comments above

@roberth
Copy link
Member Author

roberth commented Jun 25, 2021

looks like it addresses some of the FIXME comments above

The existing fixmes in boehm are for an unrelated feature on very rare architectures where the stack grows in the other direction. I did not implement those fixmes because we don't need it. Otoh, I couldn't resist writing the code to support these rare archs in my own part. I'll remove it, so the next person won't think it matters to Nix.

@roberth
Copy link
Member Author

roberth commented Jun 25, 2021

should the boehm-gc patch be upstreamed?

As it is, it is part of a very specific solution that works for Nix, which isn't ideal because it scans garbage stack sections sometimes. That's ok because it's a conservative garbage collector and in the case of Nix those garbage sections will not be scanned anymore when the coroutine has ended. This is the main reason I have doubts about upstreaming.
A more accurate implementation involves more changes as outlined in the inline comments in the patch. These affect both boehmgc and boost_coroutine2.
So the PR is a local optimum. With significant effort, it can be made accurate and upstream-worthy, but it improves Nix only marginally, so it's not something I would enjoy doing in my free time.

@manveru
Copy link
Contributor

manveru commented Jun 26, 2021

Can confirm that this fixes a segfault we experience when building github:input-output-hk/daedalus/4.1.0#daedalus with GC enabled.

@domenkozar
Copy link
Member

should the boehm-gc patch be upstreamed?

As it is, it is part of a very specific solution that works for Nix, which isn't ideal because it scans garbage stack sections sometimes. That's ok because it's a conservative garbage collector and in the case of Nix those garbage sections will not be scanned anymore when the coroutine has ended. This is the main reason I have doubts about upstreaming.
A more accurate implementation involves more changes as outlined in the inline comments in the patch. These affect both boehmgc and boost_coroutine2.
So the PR is a local optimum. With significant effort, it can be made accurate and upstream-worthy, but it improves Nix only marginally, so it's not something I would enjoy doing in my free time.

That's fair. Could we open an issue and explain what we did, so they're aware of our use case?

@domenkozar domenkozar requested a review from edolstra June 29, 2021 09:58
@edolstra edolstra merged commit f14c3b6 into NixOS:master Jun 29, 2021
@edolstra
Copy link
Member

Really great work, thanks!

Yeah it would be good to get upstream's ideas about this.

@roberth
Copy link
Member Author

roberth commented Jul 1, 2021

While formulating my question I've found issues I hadn't seen before, with some recommendations. I'll summarize my findings here.
ivmai/bdwgc#150 is an interaction with someone who seems to have found a similar solution. Since then, functions have been added to manipulate the recorded stack base per thread. Maintaining these stack base pointers requires a gc lock around the coroutine context switch. While the current approach to scanning the coroutine stacks works, it'd be better to provide our own stack "pushing" routine as mentioned here, which runs as part of gc and does not need to worry about synchronization as much. It also may help with potential mark stack overflows, although I don't think we've experienced those.
So most of this will be helpful when writing a more accurate solution.
For now our solution is similar to ivmai/bdwgc#150 and we err on the side of overly conservative gc when it comes to stacks during coroutine execution.

jonringer pushed a commit to jonringer/nixpkgs that referenced this pull request Jul 19, 2021
As has been done in NixOS/nix#4944

This introduces the boehmgc_nix and boehmgc_nixUnstable attributes
which are useful for external packages that link with Nix and its
boehmgc.
github-actions bot pushed a commit to NixOS/nixpkgs that referenced this pull request Jul 21, 2021
As has been done in NixOS/nix#4944

This introduces the boehmgc_nix and boehmgc_nixUnstable attributes
which are useful for external packages that link with Nix and its
boehmgc.

(cherry picked from commit 596ac24)
@lilyball
Copy link
Member

The boehmgc patch only applies to Linux. Boehm GC has a separate "stop the world and push thread stacks" implementation for Darwin (see darwin_stop_world.c) and this still crashes when GC is invoked from a coroutine.

@roberth
Copy link
Member Author

roberth commented May 11, 2022

The boehmgc patch only applies to Linux.

This was an oversight. See #4919

crashes when GC is invoked from a coroutine.

It's more frequent than that. The main cause is that an OS thread has its stack pointer in a coroutine, which is not supported by an unpatched GC. Which thread (or stack) invokes the GC doesn't really matter beyond that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants