-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rc: do not call free/malloc in signal handlers #589
Comments
Just a quick look, and there seems to be more issue than just the
|
A couple of those seem to be low hanging fruits. Some of them however I presume would require significant effort to make async-signal-safe. If there's interest in fixing these upstream, then I can look into some of them. As for the original reported issue, marking the item as "to be freed" and freeing them elsewhere seems like it would be the least intrusive change (assuming leaking the memory away isn't acceptable). Another solution could be to pre-allocate a buffer for backing the list and avoid malloc/free when doing list insertion/removal entirely (or at least reduce chances of it occurring inside the sig-handler). |
OpenRC needs to respect async-signal-safe settings regardless of difficulty |
malloc (called by xasprintf) is not async-signal-safe. beside, the string here is constant, so there's no need to malloc it all. eerrorx isn't async-signal-safe either (due to calling fprintf and exit) but consequence of them are _typically_ not as grave as calling malloc while it's internal state is inconsistent. Bug: OpenRC#589
After looking a bit more into this - and correct me if I'm wrong - but I don't think this is safe to do. The signal might occur during a list operation (e.g |
One saving grace is that the pid list is modified only during Anything that I'm overlooking here? Or any better ideas? |
Right -- I was only looking at the happy path, as that's what's likely to be causing my hang. Fix things that aren't working first, then finish improving it.
ugh... Good point, we aren't using atomic list operations... Will think on it a bit more (it's 6am...) but don't necessarily wait for me; cheers! |
signalfd actually looks more portable than I thought -- but even if we can't use it (I didn't see a list of OS we support to check?), the idea is actually usable: we just need to mark a signal as having been received in the signal handler and do the actual handling in the main thread (e.g. in |
malloc (called by xasprintf) is not async-signal-safe. beside, the string here is constant, so there's no need to malloc it all. eerrorx isn't async-signal-safe either (due to calling fprintf and exit) but consequence of them are _typically_ not as grave as calling malloc while it's internal state is inconsistent. Bug: OpenRC#589
malloc (called by xasprintf) is not async-signal-safe. beside, the string here is constant, so there's no need to malloc it all. eerrorx isn't async-signal-safe either (due to calling fprintf and exit) but consequence of them are _typically_ not as grave as calling malloc while it's internal state is inconsistent. Bug: OpenRC#589
malloc (called by xasprintf) is not async-signal-safe. beside, the string here is constant, so there's no need to malloc it all. eerrorx isn't async-signal-safe either (due to calling fprintf and exit) but consequence of them are _typically_ not as grave as calling malloc while it's internal state is inconsistent. Bug: #589
Basically the OS's we support are linux and *bsd, a quick google seems to say that signalfd is portable, so it may be interesting to consider using. |
I was just informed that signalfd doesn't exist on *bsd, so we probably shouldn't use it. :( |
ah, sorry I was lead to believe signalfd was available because wayland uses them inconditionally (and wayland works on BSDs), but it comes from https://github.com/jiixyj/epoll-shim on these platforms. Either way, as I said we don't actually need signalfd to use the same principle -- just toggle a flag in the signal handler, and if polling is really required a plain pipe() will do |
I agree, merely toggling a flag is certainly the best handler imagineable. Checking after i.e. waitpid() and postponing maintenance till then should usually do the trick. Just added in my 2 cents, since I opened https://bugs.gentoo.org/434532 a 'little while' back. |
`free` is not async-signal-safe and calling it inside a signal handler can have bad effects, as reported in the musl ML: https://www.openwall.com/lists/musl/2023/01/23/1 the solution: - keep track of weather remove_pid() is being called from inside a signal handler or not. - if it's inside a signal handler then DO NOT call free - instead put that pointer into a "to be freed later" list. - if it's not inside a signal handler then take the "to be freed later" list and free anything in it. Bug: OpenRC#589 Reported-by: Dominique MARTINET <dominique.martinet@atmark-techno.com>
the pid list will be accessed inside the signal handler. so we must ensure that a list operation doesn't get interrupted by a signal leaving the list at an inconsistent state (making it unsafe to interact with). Bug: OpenRC#589 (comment)
Finally got around setting up a chroot and testing the patch I had laying around. From some light testing and stepping through the code, it seems to working as intended. But weather it will fix @martinetd's hangs or not is a different story. See #594. |
`free` is not async-signal-safe and calling it inside a signal handler can have bad effects, as reported in the musl ML: https://www.openwall.com/lists/musl/2023/01/23/1 the solution: - keep track of weather remove_pid() is being called from inside a signal handler or not. - if it's inside a signal handler then DO NOT call free - instead put that pointer into a "to be freed later" list. - if it's not inside a signal handler then take the "to be freed later" list and free anything in it. Bug: OpenRC#589 Reported-by: Dominique MARTINET <dominique.martinet@atmark-techno.com>
the pid list will be accessed inside the SIGCHLD signal handler. so we must ensure SIGCHLD handler doesn't get invoked while the list is at an inconsistent state making it unsafe to interact with. Bug: OpenRC#589 (comment)
the pid list will be accessed inside the SIGCHLD signal handler. so we must ensure SIGCHLD handler doesn't get invoked while the list is at an inconsistent state making it unsafe to interact with. Co-authored-by: Dominique MARTINET <dominique.martinet@atmark-techno.com> Bug: OpenRC#589 (comment)
same rational as 459783b Bug: OpenRC#589
same rational as 459783b Bug: OpenRC#589
i had this issue too on an Amlogic S905d tv box with alpine linux. the test i did from another machine: for i in {1..20} ; do
echo -n "the $i rebooting test : "
ssh root@s905d 'echo ssh_success ; reboot'
sleep 35
done
one of the reboot would fail, and it is not able to ssh into it again, had to do a hard reset. EDIT: it happens fast especially when a HDMI cable is connected between the tv box and a monitor. |
That's pretty much my reproducer, yes. I've confirmed the PR fixes the issue on my end, but if you have time would you be able to test #594 ? |
@martinetd i did a 50 times reboot loop test again with HDMI connected, unfortunately i can't reproduce it again on
back in Dec, 2022 or Jan this year, it hangs within 20 reboot loop test. |
`free` is not async-signal-safe and calling it inside a signal handler can have bad effects, as reported in the musl ML: https://www.openwall.com/lists/musl/2023/01/23/1 the solution: - keep track of weather remove_pid() is being called from inside a signal handler or not. - if it's inside a signal handler then DO NOT call free - instead put that pointer into a "to be freed later" list. - if it's not inside a signal handler then take the "to be freed later" list and free anything in it. Bug: #589 Reported-by: Dominique MARTINET <dominique.martinet@atmark-techno.com>
the pid list will be accessed inside the SIGCHLD signal handler. so we must ensure SIGCHLD handler doesn't get invoked while the list is at an inconsistent state making it unsafe to interact with. Co-authored-by: Dominique MARTINET <dominique.martinet@atmark-techno.com> Bug: #589 (comment)
|
that depends on the entire runtime environment, not just the OS kernel. the C library used is probably the biggest factor. some ioctl's may be handled by the C library first and not only passed thru to the kernel. POSIX does not include ioctl as async-signal-safe. glibc lists it as AS-Safe. |
In that case |
we're not doing that |
I've spent part of this week chasing after a dead lock in "openrc shutdown" (running shutdown or reboot in a loop hangs once every few days in a
malloc()
call):https://www.openwall.com/lists/musl/2023/01/23/1
It turns out the problem is almost certainly that we're doing a free in our SIGCHLD handler:
free/malloc are not signal-safe, in particular musl libc does not take any lock as long as the process does not create threads -- if there had been we would likely just have seen a deadlock instead -- so if we have a stack like the following (bottom to top):
the free can be called while the malloc state is inconsistent and further corrupt mallocng's internal state.
(Yes that probably wouldn't cause problems without rc_parallel, call me a red neck)
Anyway, there are a few more mallocs in the
SIGUSR1
'src_plugin_run
but as a first approximation I'd like to make thatremove_pid
delay its free. The most straight-forward solution I can think of would be to have the signal handler not call remove_pid, but instead move the list link to a "to free later" list that can be checked once in a while during the main thread.That still is quite a bit of overengineering, but I cannot think of anything else that would be safe short of just leaking the memory. Would that be acceptable?
I'll open a PR early next week unless someone beats me to it.
The text was updated successfully, but these errors were encountered: