Skip to content

WaitForExitAsync hangs and leads to zombies process are created (Linux) #114847

Closed
@Kachelda

Description

@Kachelda

Hello!
I faced some unusual situation and spent already a couple of days to find the similar issues or any ways to explain what happens.
It's very difficult to reproduce, I caught that only two times. Everything works properly and then suddenly some commands are stuck and fail with timeout on WaitForExitAsync. The remarkable sign is that all these processes become zombies. If I restart my service - everything gets fixed and works properly.
Once these commands are stuck, then it repeats every time and amount of zombie processes keeps increasing (have seen +2000)
I use Linux (Rocky 8.1, .NET runtime 8.0.14)

I call many commands but these are the ones when I get stuck
systemctl is-enabled --quiet sshd.service systemctl is-active --quiet sshd.service bash -c dmesg -T > /var/log/dmesg.log

recently I added this part and last time it was stuck as well
dotnet /opt/SomeApp/Some.App.dll

I use this code to run commands inside my app(service)

public ProcessResult Run(CancellationToken cancellationToken = default)
    => RunAsync(cancellationToken).ConfigureAwait(false).GetAwaiter().GetResult();

public async Task<ProcessResult> RunAsync(CancellationToken cancellationToken = default)
    => await ExecProcessAsync(cancellationToken).ConfigureAwait(false);

private async Task<ProcessResult> ExecProcessAsync(CancellationToken cancellationToken = default)
{
    var startInfo = new ProcessStartInfo(Filename)
    {
        WindowStyle = ProcessWindowStyle.Hidden,
        RedirectStandardOutput = true,
        RedirectStandardError  = true,
        RedirectStandardInput  = true,
        UseShellExecute = false,
        CreateNoWindow = false
    };
    startInfo.AddArguments(Arguments, EscapeArguments);
    startInfo.AddEnvironmentVariables(EnvironmentVariables, Logger);    
    var processCmd = $"{startInfo.FileName} {string.Join(" ", Arguments)}";
    using var process = new Process { StartInfo = startInfo, EnableRaisingEvents = false };
    Logger.Debug($"Start process: {processCmd}");

    if (!process.Start())
    {
        if (DetailedLogging)
            Logger.Debug("Failed to start process: {ProcessCmd}", processCmd);
        return ProcessResult.FailedToStart(processCmd);
    }

    if (DetailedLogging)
        Logger.Debug("Process {ProcessId} has been started", process.Id);

    if (StdIn != null)
    {
        if (DetailedLogging)
            Logger.Debug("Writing data '{StdInData}' to the input", StdIn);

        await process.StandardInput.WriteLineAsync(StdIn.AsMemory(), cancellationToken).ConfigureAwait(false);
        process.StandardInput.Close();
    }
    var tStandardOutput = process.StandardOutput.ReadToEndAsync().ConfigureAwait(false);
    var tStandardError = process.StandardError.ReadToEndAsync().ConfigureAwait(false);
    var timeoutCts = new CancellationTokenSource(Timeout == TimeSpan.Zero ? Threading.Timeout.InfiniteTimeSpan : Timeout);
    var linkedCts = CancellationTokenSource.CreateLinkedTokenSource(timeoutCts.Token, cancellationToken);
    try
    {
        if (DetailedLogging)
            Logger.Debug($"Process {process.Id} WaitForExitAsync started...");

        await process.WaitForExitAsync(linkedCts.Token).ConfigureAwait(false);

        if (DetailedLogging)
            Logger.Debug($"Process {process.Id} WaitForExitAsync finished...");
    }
    catch (OperationCanceledException) when (timeoutCts.Token.IsCancellationRequested)
    {
        KillProcessSilently(process);
        return ProcessResult.ExceededByTimeout(processCmd, stdOutput: await tStandardOutput, stdError: await tStandardError);
    }
    catch (OperationCanceledException) when (cancellationToken.IsCancellationRequested)
    {
        KillProcessSilently(process);
        return ProcessResult.CancellationRequested(processCmd, stdOutput: await tStandardOutput, stdError: await tStandardError);
    }
    finally
    {
        timeoutCts.Dispose();
        linkedCts.Dispose();
    }
    return ProcessResult.ExecuteResult(processCmd, process.ExitCode, stdOutput: await tStandardOutput, stdError: await tStandardError);
}

private void KillProcessSilently(Process process)
{
    try
    {
        if (DetailedLogging)
            Logger.Debug("[TBDL] Killing the process {ProcessId} silently...", process.Id);

        process.Kill(true);

        if (DetailedLogging)
            Logger.Debug("[TBDL] Process {ProcessId} has been killed", process.Id);
    }
    catch (Exception ex)
    {
        Logger.Error(ex, $"Unable to kill process {process.Id}");
    }
}

I do not think it's somehow related with redirecting issues (it's very common problem, read a lot about that)
I already lost my lab, it was accidentally restarted so now it works perfectly again.
If I had trouble reading an output, I'd be stuck reading the output (await tStandardOutput), already simulated and tested that behavior.

At the same time it looks like other processes can be run (e.g. collecting statistic every min "top -bn 2 -d 0.01")
So I do not believe in pool starvation or something like that.

I'm not very deep into all this mechanism under the hood, I mean how the SIGCHLD signal is dispatched for both cases (WaitForExit and WaitForExitAsync)
Just read a little about
Synchronous WaitForExit():
Directly uses the native system call waitpid() for the specific PID in the current thread
Blocks thread execution until the process completes
Does not use the complex chain with pipe and dispatcher thread
Interfaces directly with the kernel's process state tracking mechanism, though SIGCHLD is still generated by the kernel

Asynchronous WaitForExitAsync():
Registers an Exited event handler
Relies on the SIGCHLD → pipe → dispatcher → ThreadPool handling mechanism
Does not block the execution thread, freeing it for other tasks
Depends on the normal operation of the entire signal processing chain
Uses a TaskCompletionSource to complete the returned Task when the process exit is detected

it is right? or maybe wrong information
so in my case I think something becomes broken in async chain (that's why all processes are zombies)

To be honest, so far I have no idea what to do and why it happens
maybe the SIGCHL is intercepted by other handler or something like that

can you help me please?
Thanks a lot in advance!

UPD.
I was wrong, all processes that were run after the case had been appeared, were being stuck and failed with timeout, including collecting statistic every min "top -bn 2 -d 0.01"
that's the last picture in top before the service was restarted
Tasks: 2297 total, 1 running, 149 sleeping, 0 stopped, 2147 zombie %Cpu(s): 10.5 us, 10.5 sy, 0.0 ni, 78.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st MiB Mem : 3589.2 total, 1046.3 free, 915.4 used, 1627.5 buff/cache MiB Swap: 4096.0 total, 4091.6 free, 4.4 used. 1840.5 avail Mem

and that's how is was before all this had been started
Tasks: 147 total, 1 running, 146 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st MiB Mem : 3589.2 total, 1553.8 free, 689.2 used, 1346.3 buff/cache MiB Swap: 4096.0 total, 4096.0 free, 0.0 used. 2272.8 avail Mem

top -bn 2 -d 0.01
this one was the first which was failed, ~2min after the service was started

UPD2.
Did some new tests. I was able to reproduce my case only via overriding the default .NET SIGCHLD handler by my custom empty handler (using sigaction). So only after that all processes started becoming zombies, until I restored the default handler.
Currently I think that something overrode my default handler.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions