Skip to content

fix: eliminate orphaned process leaks causing OOM kills#222

Open
dangtrivan15 wants to merge 5 commits intoAutoForgeAI:masterfrom
dangtrivan15:fix/process-orphan-leaks
Open

fix: eliminate orphaned process leaks causing OOM kills#222
dangtrivan15 wants to merge 5 commits intoAutoForgeAI:masterfrom
dangtrivan15:fix/process-orphan-leaks

Conversation

@dangtrivan15
Copy link

Summary

  • Process group kill: kill_process_tree() now uses os.killpg() when the subprocess was started with start_new_session=True, killing the entire process group atomically. This eliminates the TOCTOU race where children get reparented to PID 1 before psutil.children() runs.
  • start_new_session=True on all 6 spawn sites: Every subprocess.Popen call (orchestrator, agent manager, dev server manager) now creates a new process group on Unix.
  • cli.js process group kill: killProcess() now sends SIGTERM to -pid (the process group) instead of a single PID. Uvicorn is spawned with detached: true.
  • Periodic orphan reaper: Background task (Linux only) scans every 60s for chrome/node/esbuild processes orphaned under PID 1 and kills them after a 30s grace period.

Problem

After each agent session, dev server processes (node, npm, vite, esbuild) and Playwright browser instances (chrome, chrome_crashpad) survive as orphans. Over hours, these accumulate to hundreds of processes consuming gigabytes of memory, eventually hitting the container memory limit and triggering OOM kills.

Root cause: kill_process_tree() uses psutil.Process(pid).children(recursive=True) to find children, but between the parent dying and the tree walk executing, children get reparented to PID 1 and are no longer in the tree.

Observed in production: 183 out of 243 processes were orphans (~67 Chrome processes, ~49 dev server processes, ~56 esbuild processes), consuming 4.4 GB growing toward a 10 GB limit.

Test plan

  • Run python -m pytest tests/test_process_utils.py -v — verifies process group kill works on children and grandchildren
  • Deploy to a container, run agents for several hours, monitor orphan count stays near zero:
    kubectl exec <pod> -- bash -c 'for f in /proc/[0-9]*/status; do ppid=$(grep "^PPid:" $f 2>/dev/null | awk "{print \$2}"); pid=$(grep "^Pid:" $f 2>/dev/null | awk "{print \$2}"); [ "$ppid" = "1" ] && [ "$pid" != "1" ] && echo $pid; done | wc -l'
  • Verify memory stays stable via cat /sys/fs/cgroup/memory.current
  • Verify Windows is unaffected (process group changes are Unix-only)

Fixes #164, fixes #197.

🤖 Generated with Claude Code

dangtrivan15 and others added 5 commits March 8, 2026 19:52
kill_process_tree() now uses os.killpg() when the subprocess was started
with start_new_session=True, killing the entire process group atomically
before falling back to psutil tree walk. This eliminates the race where
children get reparented to PID 1 before psutil.children() runs.

Addresses AutoForgeAI#164, AutoForgeAI#197.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Each subprocess now runs in its own process group, enabling atomic
cleanup via os.killpg(). This prevents children from escaping
kill_process_tree() by getting reparented to PID 1.

6 spawn sites updated: 4 in parallel_orchestrator.py,
1 in process_manager.py, 1 in dev_server_manager.py.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
killProcess() now sends SIGTERM to -pid (the process group), ensuring
all child processes are terminated on Ctrl+C or SIGTERM. Uvicorn is
spawned with detached:true to create a new process group on Unix.
Also switches from execSync to execFileSync to avoid shell injection.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Scans every 60s for chrome/node/esbuild processes orphaned under PID 1
and kills them after a 30s grace period. Only active on Linux containers.
This is defense-in-depth: catches any orphans that escape the
process-group kill mechanism.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
dangtrivan15 added a commit to dangtrivan15/autoforge that referenced this pull request Mar 8, 2026
The orphan reaper kills live orphaned processes but cannot clean up
zombies (state Z) when the container PID 1 doesn't call waitpid().
Documents known container-side and code-side solutions.

Relates to AutoForgeAI#222.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant