fix: clean up stale agents when process dies without clean disconnect#319
Merged
khaliqgant merged 5 commits intomainfrom Jan 27, 2026
Merged
fix: clean up stale agents when process dies without clean disconnect#319khaliqgant merged 5 commits intomainfrom
khaliqgant merged 5 commits intomainfrom
Conversation
When an agent process dies ungracefully (e.g., OOM, SIGKILL), the socket doesn't close cleanly, leaving the agent in the router's agents map and connected-agents.json file. This causes issues for CLI tools that rely on accurate agent lists. Changes: - Add forceRemoveAgent() to Router to clean up agent without connection - Add removeStaleAgent() to Daemon to expose cleanup and update JSON file - Call removeStaleAgent() from orchestrator's handleAgentCrash() 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Adds new APIs to help manage stale agents: **SDK (client.ts):** - `listConnectedAgents()` - Returns only currently connected agents, not historical/registered ones. Use for accurate liveness checks. - `removeAgent(name, opts)` - Removes an agent from the registry (agents.json, sessions table). Optionally removes message history. **Daemon (server.ts):** - LIST_CONNECTED_AGENTS handler - Returns agents from router.agents (live) - REMOVE_AGENT handler - Cleans up registry, storage, and router **MCP (tools):** - `relay_connected` tool - List only currently connected agents - `relay_remove_agent` tool - Clean up stale agents from registry **Protocol:** - Added LIST_CONNECTED_AGENTS, LIST_CONNECTED_AGENTS_RESPONSE message types - Added REMOVE_AGENT, REMOVE_AGENT_RESPONSE message types **Storage:** - Added `removeAgent()` method to delete from sessions table - Added `removeMessagesForAgent()` method to clean message history 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
forceRemoveAgent()to Router to clean up agent without needing the connection objectremoveStaleAgent()to Daemon to expose cleanup and updateconnected-agents.jsonremoveStaleAgent()from orchestrator'shandleAgentCrash()when dead PID is detectedProblem
When an agent process dies ungracefully (e.g., OOM kill, SIGKILL, crash), the socket doesn't close cleanly. This leaves the agent in:
agentsMapconnected-agents.jsonfileCLI tools like
listAgentsthen return stale data, causing issues for orchestrators that rely on accurate agent lists.Solution
The orchestrator's health monitoring already detects dead PIDs every 10 seconds via
checkAgentHeartbeats(). This PR adds cleanup logic tohandleAgentCrash()to:connected-agents.jsonimmediatelyTest plan
kill -9 <pid>connected-agents.jsonwithin ~10 secondslistAgentsMCP call returns accurate list🤖 Generated with Claude Code