Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

9wm: Ignore SIGCHLD instead of handling it #16

Closed
wants to merge 1 commit into from
Closed

9wm: Ignore SIGCHLD instead of handling it #16

wants to merge 1 commit into from

Conversation

circl-lastname
Copy link

This prevents defunct processes from being created.
waitpid within a SIGCHLD handler doesn't do anything, as by the time you get the signal, the process has already exited. 9wm doesn't need to do anything with the process afterwards, so it's safe to just ignore.

@nealey
Copy link
Member

nealey commented Dec 17, 2021

It's been a long time since I used anything but Linux, but I'm almost positive that SIG_IGN will not clean up zombies on other types of Unix.

waitpid() within the SIGCHLD handler lets the kernel know it can clear the process structure. I wish I knew of a reference I could give you other than the Stevens book, but unfortunately it's been decades since I even thought about this issue, so I don't know where else to point you.

Are you actually seeing zombie processes lying around with the current code?

@nealey
Copy link
Member

nealey commented Dec 17, 2021

Aha, I finally found a document that explains why 9wm has this code: http://www.unixguide.net/unix/faq/3.13.shtml

How do I get rid of zombie processes that persevere?

Unfortunately, it's impossible to generalize how the death of child processes should behave, because the exact mechanism varies over the various flavors of Unix.

First of all, by default, you have to do a wait() for child processes under ALL flavors of Unix. That is, there is no flavor of Unix that I know of that will automatically flush child processes that exit, even if you don't do anything to tell it to do so.

Second, under some SysV-derived systems, if you do signal(SIGCHLD, SIG_IGN) (well, actually, it may be SIGCLD instead of SIGCHLD, but most of the newer SysV systems have #define SIGCHLD SIGCLD in the header files), then child processes will be cleaned up automatically, with no further effort in your part. The best way to find out if it works at your site is to try it, although if you are trying to write portable code, it's a bad idea to rely on this in any case.

Unfortunately, POSIX doesn't allow you to do this; the behavior of setting the SIGCHLD to SIG_IGN under POSIX is undefined, so you can't do it if your program is supposed to be POSIX-compliant.

I'm still curious what motivated you to write this patch, are you seeing problems with zombies not getting cleaned by 9wm?

@circl-lastname
Copy link
Author

circl-lastname commented Dec 17, 2021

Yes, and I run Linux.
In order to reproduce, you need to kill an unfocused program with Delete (although some can remain focused, don't get why)

From what I looked at, waitpid stops the thread until a process changes state (including exit), and SIGCHLD is sent when a child exits. Doing waitpid after SIGCHLD should do nothing, and since it has WNOHANG it returns immediately. There may be something I'm not understanding, but this solves my problem.

I can try an unpatched and patched version on some sort of BSD to see if this makes any difference.

@nealey
Copy link
Member

nealey commented Dec 17, 2021

Here's some text from man 2 wait:

A child that terminates, but has not been waited for becomes a "zombie". The kernel maintains a minimal set of information about the zombie process (PID, termination status, resource usage information) in order to allow the parent to later perform a wait to obtain information about the child. As long as a zombie is not removed from the system via a wait, it will consume a slot in the kernel process table, and if this table fills, it will not be possible to create further processes. If a parent process terminates, then its "zombie" children (if any) are adopted by init(1), (or by the nearest "subreaper" process as defined through the use of the prctl(2) PR_SET_CHILD_SUBREAPER operation); init(1) automatically performs a wait to remove the zombies.

It appears that the POSIX standard has changed, so I would expect a modern BSD to also respond to your patch the way you expect. A better test would be something old like BSD4.3 or, I don't know, Irix or HP-UX, where I'm 99.92% sure you'll see a bunch of zombies if you use your patch.

Probably more productive would be to look into why Linux is no longer reaping defunct processes handled with waitpid(), though. As a first start, something like the following might be interesting:

#include <unistd.h>
#include <signal.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <stdio.h>

#define SECOND 1
#define MINUTE (60 * SECOND)

void sigchld(int signum) {
    pid_t kid;
    while (0 < (kid = waitpid(-1, NULL, WNOHANG))) {
        printf("Child %d has exited\n", kid);
    }
}

int main(int argc, char *argv[]) {
    signal(SIGCHLD, sigchld);
    for (int i = 0; i < 20; i += 1) {
        if (!fork()) {
            execlp("sleep", "sleep", "2s", NULL);
        }
    }

    for (int remain = 15 * SECOND; remain; remain = sleep(remain)) {
        printf("sleep(%d)\n", remain);
    }

    return 0;
}

On my system, when I run this and watch the process table, the sleep processes get cleaned up right away, there are no defunct processes. What does it do on yours?

@circl-lastname
Copy link
Author

circl-lastname commented Dec 18, 2021

Ah, I guess I misunderstood the man page. I have tried this code on my machine, and the zombie processes get cleared up.
I should mention that now see what the bug actually is. It's not about an unfocused window, or maybe not even about the handler. The first process gets properly cleared, and only the next ones after that leave zombies.
Does the SIGCHLD handler get cleared every time it is used?

@circl-lastname
Copy link
Author

Hello
I looked into the source code but I'm not sure whats causing this. Any ideas? Thanks.

@nealey
Copy link
Member

nealey commented Dec 27, 2021

Are you sure 9wm is the parent process of these zombies?

@circl-lastname
Copy link
Author

Yes, I am quite sure

@nealey
Copy link
Member

nealey commented Dec 29, 2021

I'm working on a larger test program to check something. While I'm doing this, could you try the following:

  1. Trigger the situation with all the zombies
  2. Kill 9wm with the CHLD signal (eg killall -CHLD 9wm)
  3. Check if there are still zombies

If this clears out your zombies, then there might be some race condition where the child (to 9wm) exits, sends SIGCHLD, but the grandchildren are somehow getting reparented to 9wm and exiting in a way that SIGCHLD is not sent (perhaps the dead parent set SA_NOCLDWAIT), and the while/waitpid loop has already exited. The manual SIGCHLD triggers that loop again, which finds the zombies and cleans them up.

@nealey
Copy link
Member

nealey commented Dec 29, 2021

Actually, setting SA_NOCLDWAIT would prevent the creation of zombies. I'm not seeing a way to "trick" the grandparent process, but I'm also not yet finding a way to prevent orphaned children from being reparented to the process group leader, so I'm not sure how 9wm is getting them. Still researching.

It is, however, looking increasingly likely that we're seeing a race condition here.

@circl-lastname
Copy link
Author

killall -CHLD 9wm

This doesn't do anything for me.

@circl-lastname
Copy link
Author

but the grandchildren are somehow getting reparented to 9wm

I'm not sure that is the case, as for instance xterm's child (bash) gets killed, while xterm itself leaves a zombie.

@nealey
Copy link
Member

nealey commented Jan 3, 2022

but the grandchildren are somehow getting reparented to 9wm

I'm not sure that is the case, as for instance xterm's child (bash) gets killed, while xterm itself leaves a zombie.

For the record, I am unable to reproduce this situation with zombies. I believe that you're seeing it, I just can't figure out how to make it happen on my machine.

It sounds like the situation is thus:

  • You are getting zombies. You are 100% certain the zombies are parented by 9wm.
  • Invoking the SIGCHLD handler, which runs waitpid, by sending a CHLD signal to 9wm, does not clean them up
  • Therefore it appears on your system waitpid does not work the way it says it does in the man page
  • Setting the SIGCHLD handler to SIG_IGN fixes this problem
  • The two of us have run out of ideas about why waitpid isn't doing what it's supposed to do

At this point I'm at a complete loss as to what to do next. The options seem to be:

  1. Set the SIGCHLD handler to SIG_IGN, breaking things that aren't Linux or BSD
  2. Don't change anything, breaking (at least) your Linux system
  3. Hope somebody else chimes in with a bright idea about what in the world is going on

I'm loathe to implement the first option, since it's an uninformed kludge that I would have added only because I'm too dumb to understand what the actual problem is. Also I don't even know what to test for with the #ifdef around the new SIG_IGN handler: do we check to see if we're in Linux? BSD? A system implementing process groups?

The second option is also pretty bad, because 9wm becomes a real pain in the rear for people with systems similar to yours, when previously it worked just fine. They will hopefully find this pull request, merge in the code, and it we're really lucky, leave a comment that they did so, with some information pointing us at what might be causing the difference.

The third option feels like giving up, but I feel like as the maintainer of code this old, of mostly historical interest, and with an active global userbase of maybe a dozen people, it's the most responsible thing to do.

So unless @Tookmund chimes in with a different opinion (or, better, an idea about what in the world is happening), I'm going with option three.

I'm going to leave the merge request open because the work @circl-lastname did in this pull request is the correct first option path out for anyone else experiencing this problem.

If you, the reader, have any insight about why this is happening, please leave a comment!

@nealey
Copy link
Member

nealey commented Jan 3, 2022

I just reproduced this! Yay!

@nealey
Copy link
Member

nealey commented Jan 3, 2022

Running under strace, I see the SIGCHLD handler being invoked only for the first exit, and never again.

@nealey
Copy link
Member

nealey commented Jan 3, 2022

@circl-lastname Could you try this patch against master and tell me if it solves the problem for you too?

diff --git a/9wm.c b/9wm.c
index 940de04..08e0b00 100644
--- a/9wm.c
+++ b/9wm.c
@@ -176,7 +176,12 @@ main(int argc, char *argv[])
                signal(SIGINT, SIG_IGN);
        if (signal(SIGHUP, sighandler) == SIG_IGN)
                signal(SIGHUP, SIG_IGN);
-       signal(SIGCHLD, sigchld);
+       {
+               struct sigaction act = {0};
+
+               act.sa_handler = sigchld;
+               sigaction(SIGCHLD, &act, NULL);
+       }
 
        exit_9wm = XInternAtom(dpy, "9WM_EXIT", False);
        restart_9wm = XInternAtom(dpy, "9WM_RESTART", False);

@circl-lastname
Copy link
Author

I will be trying your patch today

@circl-lastname
Copy link
Author

Well, sorry, I was busy, but your patch does work.

@nealey
Copy link
Member

nealey commented Jan 19, 2022

No worries! I'll push the fix and close this.

@nealey nealey closed this in 5ec37f3 Jan 19, 2022
@nealey
Copy link
Member

nealey commented Jan 19, 2022

Thanks for your patience in helping debug this!

nealey pushed a commit that referenced this pull request Jan 19, 2022
@circl-lastname
Copy link
Author

No problem! Also, while this isn't a huge issue, my e-mail is circl.lastname@gmail.com, rather than circl-lastname@gmail.com, but I'm nitpicking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants