Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rc: do not call free/malloc in signal handlers #589

Open
martinetd opened this issue Jan 27, 2023 · 19 comments
Open

rc: do not call free/malloc in signal handlers #589

martinetd opened this issue Jan 27, 2023 · 19 comments

Comments

@martinetd
Copy link
Contributor

I've spent part of this week chasing after a dead lock in "openrc shutdown" (running shutdown or reboot in a loop hangs once every few days in a malloc() call):
https://www.openwall.com/lists/musl/2023/01/23/1

It turns out the problem is almost certainly that we're doing a free in our SIGCHLD handler:

static void handle_signal(int sig) {
...
                /* Remove that pid from our list */
                if (pid > 0)
                        remove_pid(pid);
...
static void remove_pid(pid_t pid) {
...
                    LIST_REMOVE(p, entries); 
                    free(p);

free/malloc are not signal-safe, in particular musl libc does not take any lock as long as the process does not create threads -- if there had been we would likely just have seen a deadlock instead -- so if we have a stack like the following (bottom to top):

openrc functions...
malloc()
handle_signal()
free()

the free can be called while the malloc state is inconsistent and further corrupt mallocng's internal state.
(Yes that probably wouldn't cause problems without rc_parallel, call me a red neck)

Anyway, there are a few more mallocs in the SIGUSR1's rc_plugin_run but as a first approximation I'd like to make that remove_pid delay its free. The most straight-forward solution I can think of would be to have the signal handler not call remove_pid, but instead move the list link to a "to free later" list that can be checked once in a while during the main thread.
That still is quite a bit of overengineering, but I cannot think of anything else that would be safe short of just leaking the memory. Would that be acceptable?

I'll open a PR early next week unless someone beats me to it.

@N-R-K
Copy link
Contributor

N-R-K commented Jan 27, 2023

Just a quick look, and there seems to be more issue than just the free call. A couple other non-signal-safe function that I see being called:

  • fprintf (via eerror)
  • malloc (via xasprintf)
  • ioctl
  • exit
  • fclose, setenv, malloc and a whole bunch more inside rc_plugin_run

@N-R-K
Copy link
Contributor

N-R-K commented Jan 27, 2023

A couple of those seem to be low hanging fruits. Some of them however I presume would require significant effort to make async-signal-safe.

If there's interest in fixing these upstream, then I can look into some of them.


As for the original reported issue, marking the item as "to be freed" and freeing them elsewhere seems like it would be the least intrusive change (assuming leaking the memory away isn't acceptable).

Another solution could be to pre-allocate a buffer for backing the list and avoid malloc/free when doing list insertion/removal entirely (or at least reduce chances of it occurring inside the sig-handler).

@vapier
Copy link
Member

vapier commented Jan 27, 2023

OpenRC needs to respect async-signal-safe settings regardless of difficulty

N-R-K added a commit to N-R-K/openrc that referenced this issue Jan 27, 2023
malloc (called by xasprintf) is not async-signal-safe. beside, the
string here is constant, so there's no need to malloc it all.

eerrorx isn't async-signal-safe either (due to calling fprintf and exit)
but consequence of them are _typically_ not as grave as calling malloc
while it's internal state is inconsistent.

Bug: OpenRC#589
@N-R-K
Copy link
Contributor

N-R-K commented Jan 27, 2023

move the list link to a "to free later" list

After looking a bit more into this - and correct me if I'm wrong - but I don't think this is safe to do.

The signal might occur during a list operation (e.g LIST_INSERT_* or LIST_REMOVE) which can leave the list at an inconsistent state making it unsafe to interact with.

@N-R-K
Copy link
Contributor

N-R-K commented Jan 27, 2023

The signal might occur during a list operation (e.g LIST_INSERT_* or LIST_REMOVE) which can leave the list at an inconsistent state making it unsafe to interact with.

One saving grace is that the pid list is modified only during add_pid and remove_pid (and cleanup but I think we can ignore that one). So if we block signal during the LIST_INSERT_HEAD and LIST_REMOVE then we might be able to get away with the "free later list" idea.

Anything that I'm overlooking here? Or any better ideas?

@martinetd
Copy link
Contributor Author

Just a quick look, and there seems to be more issue than just the free call. A couple other non-signal-safe function that I see being called [...]

Right -- I was only looking at the happy path, as that's what's likely to be causing my hang. Fix things that aren't working first, then finish improving it.
I agree ultimately we'll want to try harder™, but it looks like it might take a while; I like your approach with the first PR, thanks for working on it! (and thanks vapier for taking it seriously as well -- I've had similar problems in other projects where the attitude was quite different...)

The signal might occur during a list operation (e.g LIST_INSERT_* or LIST_REMOVE) which can leave the list at an inconsistent state making it unsafe to interact with.

ugh... Good point, we aren't using atomic list operations...
Blocking signals should work, they'll be queued up until we unblock it so while it has a hammer feel I think it'll be safe. (signalfd would be another neat way of allowing all kind of async-unsafe things in handlers if it were portable, but it's not quite standard and I'm just thinking out loud)
There are atomic-update list algorithms but I don't think there's anything that'd allow to walk through the list at any time, and locks would just deadlock, so I don't have any better suggestion right now.

Will think on it a bit more (it's 6am...) but don't necessarily wait for me; cheers!

@martinetd
Copy link
Contributor Author

signalfd would be another neat way of allowing all kind of async-unsafe things in handlers if it were portable, but it's not quite standard and I'm just thinking out loud

signalfd actually looks more portable than I thought -- but even if we can't use it (I didn't see a list of OS we support to check?), the idea is actually usable: we just need to mark a signal as having been received in the signal handler and do the actual handling in the main thread (e.g. in wait_for_services and perhaps a few other places?)
That'll likely be easier than trying to fix all the things, for example I have no idea how to make the ioctl calls used in SIGWINCH async-signal safe...

N-R-K added a commit to N-R-K/openrc that referenced this issue Jan 28, 2023
malloc (called by xasprintf) is not async-signal-safe. beside, the
string here is constant, so there's no need to malloc it all.

eerrorx isn't async-signal-safe either (due to calling fprintf and exit)
but consequence of them are _typically_ not as grave as calling malloc
while it's internal state is inconsistent.

Bug: OpenRC#589
N-R-K added a commit to N-R-K/openrc that referenced this issue Jan 28, 2023
malloc (called by xasprintf) is not async-signal-safe. beside, the
string here is constant, so there's no need to malloc it all.

eerrorx isn't async-signal-safe either (due to calling fprintf and exit)
but consequence of them are _typically_ not as grave as calling malloc
while it's internal state is inconsistent.

Bug: OpenRC#589
vapier pushed a commit that referenced this issue Jan 28, 2023
malloc (called by xasprintf) is not async-signal-safe. beside, the
string here is constant, so there's no need to malloc it all.

eerrorx isn't async-signal-safe either (due to calling fprintf and exit)
but consequence of them are _typically_ not as grave as calling malloc
while it's internal state is inconsistent.

Bug: #589
@williamh
Copy link
Contributor

Basically the OS's we support are linux and *bsd, a quick google seems to say that signalfd is portable, so it may be interesting to consider using.

@williamh
Copy link
Contributor

I was just informed that signalfd doesn't exist on *bsd, so we probably shouldn't use it. :(

@martinetd
Copy link
Contributor Author

ah, sorry I was lead to believe signalfd was available because wayland uses them inconditionally (and wayland works on BSDs), but it comes from https://github.com/jiixyj/epoll-shim on these platforms.

Either way, as I said we don't actually need signalfd to use the same principle -- just toggle a flag in the signal handler, and if polling is really required a plain pipe() will do

@DarKRaveR77
Copy link

I agree, merely toggling a flag is certainly the best handler imagineable. Checking after i.e. waitpid() and postponing maintenance till then should usually do the trick.

Just added in my 2 cents, since I opened https://bugs.gentoo.org/434532 a 'little while' back.

N-R-K added a commit to N-R-K/openrc that referenced this issue Jan 29, 2023
`free` is not async-signal-safe and calling it inside a signal handler
can have bad effects, as reported in the musl ML:
https://www.openwall.com/lists/musl/2023/01/23/1

the solution:

- keep track of weather remove_pid() is being called from inside a
  signal handler or not.
- if it's inside a signal handler then DO NOT call free - instead put
  that pointer into a "to be freed later" list.
- if it's not inside a signal handler then take the "to be freed later"
  list and free anything in it.

Bug: OpenRC#589
Reported-by: Dominique MARTINET <dominique.martinet@atmark-techno.com>
N-R-K added a commit to N-R-K/openrc that referenced this issue Jan 29, 2023
the pid list will be accessed inside the signal handler. so we must
ensure that a list operation doesn't get interrupted by a signal leaving
the list at an inconsistent state (making it unsafe to interact with).

Bug: OpenRC#589 (comment)
@N-R-K
Copy link
Contributor

N-R-K commented Jan 29, 2023

Finally got around setting up a chroot and testing the patch I had laying around. From some light testing and stepping through the code, it seems to working as intended. But weather it will fix @martinetd's hangs or not is a different story.

See #594.

N-R-K added a commit to N-R-K/openrc that referenced this issue Jan 30, 2023
`free` is not async-signal-safe and calling it inside a signal handler
can have bad effects, as reported in the musl ML:
https://www.openwall.com/lists/musl/2023/01/23/1

the solution:

- keep track of weather remove_pid() is being called from inside a
  signal handler or not.
- if it's inside a signal handler then DO NOT call free - instead put
  that pointer into a "to be freed later" list.
- if it's not inside a signal handler then take the "to be freed later"
  list and free anything in it.

Bug: OpenRC#589
Reported-by: Dominique MARTINET <dominique.martinet@atmark-techno.com>
N-R-K added a commit to N-R-K/openrc that referenced this issue Jan 30, 2023
the pid list will be accessed inside the SIGCHLD signal handler. so we
must ensure SIGCHLD handler doesn't get invoked while the list is at an
inconsistent state making it unsafe to interact with.

Bug: OpenRC#589 (comment)
N-R-K added a commit to N-R-K/openrc that referenced this issue Jan 30, 2023
the pid list will be accessed inside the SIGCHLD signal handler. so we
must ensure SIGCHLD handler doesn't get invoked while the list is at an
inconsistent state making it unsafe to interact with.

Co-authored-by: Dominique MARTINET <dominique.martinet@atmark-techno.com>
Bug: OpenRC#589 (comment)
N-R-K added a commit to N-R-K/openrc that referenced this issue Jan 31, 2023
N-R-K added a commit to N-R-K/openrc that referenced this issue Jan 31, 2023
vapier pushed a commit that referenced this issue Jan 31, 2023
vapier pushed a commit that referenced this issue Jan 31, 2023
@cattyhouse
Copy link

cattyhouse commented Mar 11, 2023

i had this issue too on an Amlogic S905d tv box with alpine linux. the test i did from another machine:

for i in {1..20} ; do 
    echo -n "the $i rebooting test : "
    ssh root@s905d 'echo ssh_success ; reboot'
    sleep 35
done

one of the reboot would fail, and it is not able to ssh into it again, had to do a hard reset.

EDIT: it happens fast especially when a HDMI cable is connected between the tv box and a monitor.

@martinetd
Copy link
Contributor Author

That's pretty much my reproducer, yes.
(takes a bit more than 20 attempts though...)

I've confirmed the PR fixes the issue on my end, but if you have time would you be able to test #594 ?
I'm not sure that'll get it merged faster, but at least it'll confirm you'll be better off as soon as it is :)

@cattyhouse
Copy link

@martinetd i did a 50 times reboot loop test again with HDMI connected, unfortunately i can't reproduce it again on alpine 3.17.2 with openrc 0.45.2 and linux 6.2.2, and with kernel cmdline boot arg:

root=UUID=9b134577-fc55-4d0a-9761-d4e132bfa89b cpufreq.default_governor=schedutil quiet loglevel=3 console=ttyAML0,115200n8 console=tty0 no_console_suspend net.ifnames=0 rootfstype=ext4 modules=meson-gx-mmc,mmc_block,dwc3-meson-g12a,dwc2,dwc3,mdio-mux-mmioreg,dwmac-meson8b,dwmac-generic,fixed,ext4

back in Dec, 2022 or Jan this year, it hangs within 20 reboot loop test.

williamh pushed a commit that referenced this issue Apr 25, 2023
`free` is not async-signal-safe and calling it inside a signal handler
can have bad effects, as reported in the musl ML:
https://www.openwall.com/lists/musl/2023/01/23/1

the solution:

- keep track of weather remove_pid() is being called from inside a
  signal handler or not.
- if it's inside a signal handler then DO NOT call free - instead put
  that pointer into a "to be freed later" list.
- if it's not inside a signal handler then take the "to be freed later"
  list and free anything in it.

Bug: #589
Reported-by: Dominique MARTINET <dominique.martinet@atmark-techno.com>
williamh pushed a commit that referenced this issue Apr 25, 2023
the pid list will be accessed inside the SIGCHLD signal handler. so we
must ensure SIGCHLD handler doesn't get invoked while the list is at an
inconsistent state making it unsafe to interact with.

Co-authored-by: Dominique MARTINET <dominique.martinet@atmark-techno.com>
Bug: #589 (comment)
@DemiMarie
Copy link

ioctl() is async signal safe on Linux and probably on other platforms too, since it is just a syscall.

@vapier
Copy link
Member

vapier commented Sep 8, 2023

that depends on the entire runtime environment, not just the OS kernel. the C library used is probably the biggest factor. some ioctl's may be handled by the C library first and not only passed thru to the kernel.

POSIX does not include ioctl as async-signal-safe.
https://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_04_03_03

glibc lists it as AS-Safe.
https://www.gnu.org/software/libc/manual/html_node/IOCTLs.html#index-ioctl

@DemiMarie
Copy link

In that case syscall(SYS_ioctl, ...) should be AS-safe.

@vapier
Copy link
Member

vapier commented Sep 8, 2023

we're not doing that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants