Skip to content

fix: clean up stale agents when process dies without clean disconnect#319

Merged
khaliqgant merged 5 commits intomainfrom
fix/stale-agent-cleanup
Jan 27, 2026
Merged

fix: clean up stale agents when process dies without clean disconnect#319
khaliqgant merged 5 commits intomainfrom
fix/stale-agent-cleanup

Conversation

@khaliqgant
Copy link
Copy Markdown
Member

@khaliqgant khaliqgant commented Jan 27, 2026

Summary

  • Adds forceRemoveAgent() to Router to clean up agent without needing the connection object
  • Adds removeStaleAgent() to Daemon to expose cleanup and update connected-agents.json
  • Calls removeStaleAgent() from orchestrator's handleAgentCrash() when dead PID is detected

Problem

When an agent process dies ungracefully (e.g., OOM kill, SIGKILL, crash), the socket doesn't close cleanly. This leaves the agent in:

  • The router's agents Map
  • The connected-agents.json file

CLI tools like listAgents then return stale data, causing issues for orchestrators that rely on accurate agent lists.

Solution

The orchestrator's health monitoring already detects dead PIDs every 10 seconds via checkAgentHeartbeats(). This PR adds cleanup logic to handleAgentCrash() to:

  1. Remove the agent from the router's internal maps
  2. Update connected-agents.json immediately

Test plan

  • Kill an agent process with kill -9 <pid>
  • Verify agent is removed from connected-agents.json within ~10 seconds
  • Verify listAgents MCP call returns accurate list

🤖 Generated with Claude Code


Open with Devin

When an agent process dies ungracefully (e.g., OOM, SIGKILL), the socket
doesn't close cleanly, leaving the agent in the router's agents map and
connected-agents.json file. This causes issues for CLI tools that rely
on accurate agent lists.

Changes:
- Add forceRemoveAgent() to Router to clean up agent without connection
- Add removeStaleAgent() to Daemon to expose cleanup and update JSON file
- Call removeStaleAgent() from orchestrator's handleAgentCrash()

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 potential issue.

View issue and 4 additional flags in Devin Review.

Open in Devin Review

Comment thread packages/daemon/src/router.ts
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View issue and 7 additional flags in Devin Review.

Open in Devin Review

Comment thread packages/daemon/src/server.ts
Adds new APIs to help manage stale agents:

**SDK (client.ts):**
- `listConnectedAgents()` - Returns only currently connected agents, not
  historical/registered ones. Use for accurate liveness checks.
- `removeAgent(name, opts)` - Removes an agent from the registry (agents.json,
  sessions table). Optionally removes message history.

**Daemon (server.ts):**
- LIST_CONNECTED_AGENTS handler - Returns agents from router.agents (live)
- REMOVE_AGENT handler - Cleans up registry, storage, and router

**MCP (tools):**
- `relay_connected` tool - List only currently connected agents
- `relay_remove_agent` tool - Clean up stale agents from registry

**Protocol:**
- Added LIST_CONNECTED_AGENTS, LIST_CONNECTED_AGENTS_RESPONSE message types
- Added REMOVE_AGENT, REMOVE_AGENT_RESPONSE message types

**Storage:**
- Added `removeAgent()` method to delete from sessions table
- Added `removeMessagesForAgent()` method to clean message history

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View issue and 11 additional flags in Devin Review.

Open in Devin Review

Comment thread packages/daemon/src/server.ts
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View issue and 15 additional flags in Devin Review.

Open in Devin Review

Comment thread packages/daemon/src/router.test.ts Outdated
@khaliqgant khaliqgant merged commit a03fab5 into main Jan 27, 2026
21 checks passed
@khaliqgant khaliqgant deleted the fix/stale-agent-cleanup branch January 27, 2026 10:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant