9wm: Ignore SIGCHLD instead of handling it #16

circl-lastname · 2021-12-17T19:15:03Z

This prevents defunct processes from being created.
waitpid within a SIGCHLD handler doesn't do anything, as by the time you get the signal, the process has already exited. 9wm doesn't need to do anything with the process afterwards, so it's safe to just ignore.

nealey · 2021-12-17T21:08:55Z

It's been a long time since I used anything but Linux, but I'm almost positive that SIG_IGN will not clean up zombies on other types of Unix.

waitpid() within the SIGCHLD handler lets the kernel know it can clear the process structure. I wish I knew of a reference I could give you other than the Stevens book, but unfortunately it's been decades since I even thought about this issue, so I don't know where else to point you.

Are you actually seeing zombie processes lying around with the current code?

nealey · 2021-12-17T21:22:41Z

Aha, I finally found a document that explains why 9wm has this code: http://www.unixguide.net/unix/faq/3.13.shtml

How do I get rid of zombie processes that persevere?

Unfortunately, it's impossible to generalize how the death of child processes should behave, because the exact mechanism varies over the various flavors of Unix.

First of all, by default, you have to do a wait() for child processes under ALL flavors of Unix. That is, there is no flavor of Unix that I know of that will automatically flush child processes that exit, even if you don't do anything to tell it to do so.

Second, under some SysV-derived systems, if you do signal(SIGCHLD, SIG_IGN) (well, actually, it may be SIGCLD instead of SIGCHLD, but most of the newer SysV systems have #define SIGCHLD SIGCLD in the header files), then child processes will be cleaned up automatically, with no further effort in your part. The best way to find out if it works at your site is to try it, although if you are trying to write portable code, it's a bad idea to rely on this in any case.

Unfortunately, POSIX doesn't allow you to do this; the behavior of setting the SIGCHLD to SIG_IGN under POSIX is undefined, so you can't do it if your program is supposed to be POSIX-compliant.

I'm still curious what motivated you to write this patch, are you seeing problems with zombies not getting cleaned by 9wm?

circl-lastname · 2021-12-17T21:27:02Z

Yes, and I run Linux.
In order to reproduce, you need to kill an unfocused program with Delete (although some can remain focused, don't get why)

From what I looked at, waitpid stops the thread until a process changes state (including exit), and SIGCHLD is sent when a child exits. Doing waitpid after SIGCHLD should do nothing, and since it has WNOHANG it returns immediately. There may be something I'm not understanding, but this solves my problem.

I can try an unpatched and patched version on some sort of BSD to see if this makes any difference.

nealey · 2021-12-17T22:26:47Z

Here's some text from man 2 wait:

A child that terminates, but has not been waited for becomes a "zombie". The kernel maintains a minimal set of information about the zombie process (PID, termination status, resource usage information) in order to allow the parent to later perform a wait to obtain information about the child. As long as a zombie is not removed from the system via a wait, it will consume a slot in the kernel process table, and if this table fills, it will not be possible to create further processes. If a parent process terminates, then its "zombie" children (if any) are adopted by init(1), (or by the nearest "subreaper" process as defined through the use of the prctl(2) PR_SET_CHILD_SUBREAPER operation); init(1) automatically performs a wait to remove the zombies.

It appears that the POSIX standard has changed, so I would expect a modern BSD to also respond to your patch the way you expect. A better test would be something old like BSD4.3 or, I don't know, Irix or HP-UX, where I'm 99.92% sure you'll see a bunch of zombies if you use your patch.

Probably more productive would be to look into why Linux is no longer reaping defunct processes handled with waitpid(), though. As a first start, something like the following might be interesting:

#include <unistd.h>
#include <signal.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <stdio.h>

#define SECOND 1
#define MINUTE (60 * SECOND)

void sigchld(int signum) {
    pid_t kid;
    while (0 < (kid = waitpid(-1, NULL, WNOHANG))) {
        printf("Child %d has exited\n", kid);
    }
}

int main(int argc, char *argv[]) {
    signal(SIGCHLD, sigchld);
    for (int i = 0; i < 20; i += 1) {
        if (!fork()) {
            execlp("sleep", "sleep", "2s", NULL);
        }
    }

    for (int remain = 15 * SECOND; remain; remain = sleep(remain)) {
        printf("sleep(%d)\n", remain);
    }

    return 0;
}

On my system, when I run this and watch the process table, the sleep processes get cleaned up right away, there are no defunct processes. What does it do on yours?

circl-lastname · 2021-12-18T12:20:47Z

Ah, I guess I misunderstood the man page. I have tried this code on my machine, and the zombie processes get cleared up.
I should mention that now see what the bug actually is. It's not about an unfocused window, or maybe not even about the handler. The first process gets properly cleared, and only the next ones after that leave zombies.
Does the SIGCHLD handler get cleared every time it is used?

circl-lastname · 2021-12-26T10:49:27Z

Hello
I looked into the source code but I'm not sure whats causing this. Any ideas? Thanks.

nealey · 2021-12-27T16:01:03Z

Are you sure 9wm is the parent process of these zombies?

circl-lastname · 2021-12-28T22:24:35Z

Yes, I am quite sure

nealey · 2021-12-29T17:12:30Z

I'm working on a larger test program to check something. While I'm doing this, could you try the following:

Trigger the situation with all the zombies
Kill 9wm with the CHLD signal (eg killall -CHLD 9wm)
Check if there are still zombies

If this clears out your zombies, then there might be some race condition where the child (to 9wm) exits, sends SIGCHLD, but the grandchildren are somehow getting reparented to 9wm and exiting in a way that SIGCHLD is not sent (perhaps the dead parent set SA_NOCLDWAIT), and the while/waitpid loop has already exited. The manual SIGCHLD triggers that loop again, which finds the zombies and cleans them up.

nealey · 2021-12-29T17:25:06Z

Actually, setting SA_NOCLDWAIT would prevent the creation of zombies. I'm not seeing a way to "trick" the grandparent process, but I'm also not yet finding a way to prevent orphaned children from being reparented to the process group leader, so I'm not sure how 9wm is getting them. Still researching.

It is, however, looking increasingly likely that we're seeing a race condition here.

circl-lastname · 2021-12-31T10:37:22Z

killall -CHLD 9wm

This doesn't do anything for me.

circl-lastname · 2021-12-31T10:43:04Z

but the grandchildren are somehow getting reparented to 9wm

I'm not sure that is the case, as for instance xterm's child (bash) gets killed, while xterm itself leaves a zombie.

nealey · 2022-01-03T19:58:56Z

but the grandchildren are somehow getting reparented to 9wm

I'm not sure that is the case, as for instance xterm's child (bash) gets killed, while xterm itself leaves a zombie.

For the record, I am unable to reproduce this situation with zombies. I believe that you're seeing it, I just can't figure out how to make it happen on my machine.

It sounds like the situation is thus:

You are getting zombies. You are 100% certain the zombies are parented by 9wm.
Invoking the SIGCHLD handler, which runs waitpid, by sending a CHLD signal to 9wm, does not clean them up
Therefore it appears on your system waitpid does not work the way it says it does in the man page
Setting the SIGCHLD handler to SIG_IGN fixes this problem
The two of us have run out of ideas about why waitpid isn't doing what it's supposed to do

At this point I'm at a complete loss as to what to do next. The options seem to be:

Set the SIGCHLD handler to SIG_IGN, breaking things that aren't Linux or BSD
Don't change anything, breaking (at least) your Linux system
Hope somebody else chimes in with a bright idea about what in the world is going on

I'm loathe to implement the first option, since it's an uninformed kludge that I would have added only because I'm too dumb to understand what the actual problem is. Also I don't even know what to test for with the #ifdef around the new SIG_IGN handler: do we check to see if we're in Linux? BSD? A system implementing process groups?

The second option is also pretty bad, because 9wm becomes a real pain in the rear for people with systems similar to yours, when previously it worked just fine. They will hopefully find this pull request, merge in the code, and it we're really lucky, leave a comment that they did so, with some information pointing us at what might be causing the difference.

The third option feels like giving up, but I feel like as the maintainer of code this old, of mostly historical interest, and with an active global userbase of maybe a dozen people, it's the most responsible thing to do.

So unless @Tookmund chimes in with a different opinion (or, better, an idea about what in the world is happening), I'm going with option three.

I'm going to leave the merge request open because the work @circl-lastname did in this pull request is the correct first option path out for anyone else experiencing this problem.

If you, the reader, have any insight about why this is happening, please leave a comment!

nealey · 2022-01-03T20:40:18Z

I just reproduced this! Yay!

nealey · 2022-01-03T20:46:09Z

Running under strace, I see the SIGCHLD handler being invoked only for the first exit, and never again.

nealey · 2022-01-03T21:00:53Z

@circl-lastname Could you try this patch against master and tell me if it solves the problem for you too?

diff --git a/9wm.c b/9wm.c
index 940de04..08e0b00 100644
--- a/9wm.c
+++ b/9wm.c
@@ -176,7 +176,12 @@ main(int argc, char *argv[])
                signal(SIGINT, SIG_IGN);
        if (signal(SIGHUP, sighandler) == SIG_IGN)
                signal(SIGHUP, SIG_IGN);
-       signal(SIGCHLD, sigchld);
+       {
+               struct sigaction act = {0};
+
+               act.sa_handler = sigchld;
+               sigaction(SIGCHLD, &act, NULL);
+       }
 
        exit_9wm = XInternAtom(dpy, "9WM_EXIT", False);
        restart_9wm = XInternAtom(dpy, "9WM_RESTART", False);

circl-lastname · 2022-01-07T08:38:17Z

I will be trying your patch today

circl-lastname · 2022-01-19T13:56:35Z

Well, sorry, I was busy, but your patch does work.

nealey · 2022-01-19T17:48:56Z

No worries! I'll push the fix and close this.

nealey · 2022-01-19T17:52:18Z

Thanks for your patience in helping debug this!

circl-lastname · 2022-01-20T13:43:45Z

No problem! Also, while this isn't a huge issue, my e-mail is circl.lastname@gmail.com, rather than circl-lastname@gmail.com, but I'm nitpicking.

9wm: Ignore SIGCHLD instead of handling it

6ffdd3a

nealey mentioned this pull request Jan 19, 2022

Only the first child is reaped #17

Closed

nealey closed this in 5ec37f3 Jan 19, 2022

nealey pushed a commit that referenced this pull request Jan 19, 2022

Credit circl-lastname on #16

06265e1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

9wm: Ignore SIGCHLD instead of handling it #16

9wm: Ignore SIGCHLD instead of handling it #16

circl-lastname commented Dec 17, 2021

nealey commented Dec 17, 2021

nealey commented Dec 17, 2021

How do I get rid of zombie processes that persevere?

circl-lastname commented Dec 17, 2021 •

edited

nealey commented Dec 17, 2021 •

edited

circl-lastname commented Dec 18, 2021 •

edited

circl-lastname commented Dec 26, 2021

nealey commented Dec 27, 2021

circl-lastname commented Dec 28, 2021

nealey commented Dec 29, 2021

nealey commented Dec 29, 2021

circl-lastname commented Dec 31, 2021

circl-lastname commented Dec 31, 2021

nealey commented Jan 3, 2022 •

edited

nealey commented Jan 3, 2022

nealey commented Jan 3, 2022

nealey commented Jan 3, 2022

circl-lastname commented Jan 7, 2022

circl-lastname commented Jan 19, 2022

nealey commented Jan 19, 2022

nealey commented Jan 19, 2022

circl-lastname commented Jan 20, 2022

9wm: Ignore SIGCHLD instead of handling it #16

9wm: Ignore SIGCHLD instead of handling it #16

Conversation

circl-lastname commented Dec 17, 2021

nealey commented Dec 17, 2021

nealey commented Dec 17, 2021

How do I get rid of zombie processes that persevere?

circl-lastname commented Dec 17, 2021 • edited

nealey commented Dec 17, 2021 • edited

circl-lastname commented Dec 18, 2021 • edited

circl-lastname commented Dec 26, 2021

nealey commented Dec 27, 2021

circl-lastname commented Dec 28, 2021

nealey commented Dec 29, 2021

nealey commented Dec 29, 2021

circl-lastname commented Dec 31, 2021

circl-lastname commented Dec 31, 2021

nealey commented Jan 3, 2022 • edited

nealey commented Jan 3, 2022

nealey commented Jan 3, 2022

nealey commented Jan 3, 2022

circl-lastname commented Jan 7, 2022

circl-lastname commented Jan 19, 2022

nealey commented Jan 19, 2022

nealey commented Jan 19, 2022

circl-lastname commented Jan 20, 2022

circl-lastname commented Dec 17, 2021 •

edited

nealey commented Dec 17, 2021 •

edited

circl-lastname commented Dec 18, 2021 •

edited

nealey commented Jan 3, 2022 •

edited