fix: eliminate orphaned process leaks causing OOM kills#222
Open
dangtrivan15 wants to merge 5 commits intoAutoForgeAI:masterfrom
Open
fix: eliminate orphaned process leaks causing OOM kills#222dangtrivan15 wants to merge 5 commits intoAutoForgeAI:masterfrom
dangtrivan15 wants to merge 5 commits intoAutoForgeAI:masterfrom
Conversation
kill_process_tree() now uses os.killpg() when the subprocess was started with start_new_session=True, killing the entire process group atomically before falling back to psutil tree walk. This eliminates the race where children get reparented to PID 1 before psutil.children() runs. Addresses AutoForgeAI#164, AutoForgeAI#197. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Each subprocess now runs in its own process group, enabling atomic cleanup via os.killpg(). This prevents children from escaping kill_process_tree() by getting reparented to PID 1. 6 spawn sites updated: 4 in parallel_orchestrator.py, 1 in process_manager.py, 1 in dev_server_manager.py. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
killProcess() now sends SIGTERM to -pid (the process group), ensuring all child processes are terminated on Ctrl+C or SIGTERM. Uvicorn is spawned with detached:true to create a new process group on Unix. Also switches from execSync to execFileSync to avoid shell injection. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Scans every 60s for chrome/node/esbuild processes orphaned under PID 1 and kills them after a 30s grace period. Only active on Linux containers. This is defense-in-depth: catches any orphans that escape the process-group kill mechanism. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
dangtrivan15
added a commit
to dangtrivan15/autoforge
that referenced
this pull request
Mar 8, 2026
The orphan reaper kills live orphaned processes but cannot clean up zombies (state Z) when the container PID 1 doesn't call waitpid(). Documents known container-side and code-side solutions. Relates to AutoForgeAI#222. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
kill_process_tree()now usesos.killpg()when the subprocess was started withstart_new_session=True, killing the entire process group atomically. This eliminates the TOCTOU race where children get reparented to PID 1 beforepsutil.children()runs.start_new_session=Trueon all 6 spawn sites: Everysubprocess.Popencall (orchestrator, agent manager, dev server manager) now creates a new process group on Unix.cli.jsprocess group kill:killProcess()now sends SIGTERM to-pid(the process group) instead of a single PID. Uvicorn is spawned withdetached: true.Problem
After each agent session, dev server processes (
node,npm,vite,esbuild) and Playwright browser instances (chrome,chrome_crashpad) survive as orphans. Over hours, these accumulate to hundreds of processes consuming gigabytes of memory, eventually hitting the container memory limit and triggering OOM kills.Root cause:
kill_process_tree()usespsutil.Process(pid).children(recursive=True)to find children, but between the parent dying and the tree walk executing, children get reparented to PID 1 and are no longer in the tree.Observed in production: 183 out of 243 processes were orphans (~67 Chrome processes, ~49 dev server processes, ~56 esbuild processes), consuming 4.4 GB growing toward a 10 GB limit.
Test plan
python -m pytest tests/test_process_utils.py -v— verifies process group kill works on children and grandchildrencat /sys/fs/cgroup/memory.currentFixes #164, fixes #197.
🤖 Generated with Claude Code