Skip to content

fix(broker): self-heal wedged brokers and quiet shutdown-race noise#67

Merged
khaliqgant merged 3 commits into
mainfrom
fix/broker-wedge-recovery-and-shutdown-noise
May 31, 2026
Merged

fix(broker): self-heal wedged brokers and quiet shutdown-race noise#67
khaliqgant merged 3 commits into
mainfrom
fix/broker-wedge-recovery-and-shutdown-noise

Conversation

@khaliqgant
Copy link
Copy Markdown
Member

Why

Two unrelated issues showed up together in the logs:

[broker] PTY input stream failed for Lead2; falling back to HTTP input: AgentRelayProtocolError: PTY input stream closed
Error occurred in handler for 'broker:list-agents': [DOMException [TimeoutError]: The operation was aborted due to timeout]
Error occurred in handler for 'broker:list-agents': [DOMException [TimeoutError]: The operation was aborted due to timeout]
...

1. Wedged broker never recovered (the repeating list-agents timeouts)

listAgentsrequest('/api/spawned') uses the SDK transport's AbortSignal.timeout(30_000). A full 30s timeout (vs. an instant ECONNREFUSED) means the broker process is alive and accepting the TCP connection but never answering — i.e. wedged, not dead.

The in-progress revive logic (isBrokerUnreachableError) only matched ECONNREFUSED/ECONNRESET, so a TimeoutError DOMException slipped past it — nothing ever restarted the wedged broker, and every poll (syncBrokerSnapshot fires on broker events, on connect, and on mount) re-spammed Error occurred in handler for 'broker:list-agents' on the main side (Electron logs the rejected handler even though the renderer swallows it).

2. Misleading shutdown-race warning

The PTY input stream failed … falling back to HTTP input line bottoms out in shutdownBrokerOnce → shutdown → closeInputStreamsForProject → closeInputStream → PtyInputStream.close. On app quit, closing the stream rejects any in-flight send() with input_stream_closed, which lands in sendInput's catch and logs as a failure — but the whole session is being destroyed, so the "fallback" is meaningless noise.

What changed (src/main/broker.ts)

  • Refactored the cause-chain walk into a shared someInCauseChain() helper.
  • Added isBrokerTimeoutError() (matches TimeoutError/AbortError, ETIMEDOUT, and the "aborted due to timeout" message through the cause chain).
  • listAgents now distinguishes failure modes:
    • Connection refused → broker is dead → respawn immediately (unchanged).
    • Timeout → could be a one-off slow response → count consecutive timeouts per project and respawn only after MAX_BROKER_TIMEOUTS_BEFORE_REVIVE = 2. Below the threshold it rethrows so the renderer keeps its stale agent list instead of flickering to empty.
  • Timeout counter resets on any successful poll (collectSessionAgents), when a revive is triggered, and on shutdown.
  • Added isInputStreamClosedError() and guarded the sendInput warning so an expected close (shutdown / project-or-agent close / terminal re-attach) falls through to HTTP silently. Real transport failures still warn.

Net effect: a wedged broker self-restarts on a fresh port within ~2 poll cycles and the handler-error spam stops on its own, instead of repeating every 30s forever; app-quit logs no longer carry the spurious PTY warning.

Notes

  • Left the SDK's global 30s requestTimeoutMs alone — shortening it would speed wedge detection but is transport-wide and would clip legitimately slow ops (spawn, etc.).
  • electron-vite build passes. No broker unit tests exist in src/main/__tests__/.

🤖 Generated with Claude Code

Two unrelated issues surfaced together in the logs:

1. `broker:list-agents` timed out every 30s forever. The in-progress
   revive logic only matched ECONNREFUSED/ECONNRESET, but a *wedged*
   broker (alive, accepting TCP, never answering HTTP) produces a
   TimeoutError DOMException, which slipped past it — so nothing ever
   recovered it and every poll re-spammed "Error occurred in handler".

   listAgents now distinguishes the two failure modes: connection
   refused = dead, respawn immediately; timeout = possibly slow, count
   consecutive timeouts per project and respawn after MAX (2). Below the
   threshold it rethrows so the renderer keeps its stale agent list
   instead of flickering to empty. The counter resets on any successful
   poll, on revive, and on shutdown.

2. "PTY input stream failed for X; falling back to HTTP input" was
   logged on every app quit. An `input_stream_closed` rejection only
   happens when we deliberately tear the stream down (shutdown, project
   /agent close, re-attach) while a keystroke is in flight — an expected
   close, not a transport failure. Fall through to HTTP silently for
   that code; real failures still warn.

Also refactored the cause-chain walk into a shared someInCauseChain()
helper used by both the unreachable and timeout detectors.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 31, 2026

Review Change Stack

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Free

Run ID: 97eab518-dc57-484c-8548-ba03f62a904c

📥 Commits

Reviewing files that changed from the base of the PR and between c7837eb and ba16852.

📒 Files selected for processing (1)
  • src/main/broker.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/main/broker.ts

📝 Walkthrough

Walkthrough

Adds broker error classification and per-project revival (dedupe + timeout counting), safe process termination and restart (reviveSession), integrates revive behavior into listAgents with retry/degrade logic, silences an expected PTY WS race in sendInput, and clears timeout state on session drop.

Changes

Broker Session Revival with Error Classification

Layer / File(s) Summary
Revive constants and process utilities
src/main/broker.ts
Adds MAX_BROKER_TIMEOUTS_BEFORE_REVIVE, BROKER_REVIVE_TERM_GRACE_MS, and process-management helpers (isProcessAlive, waitForProcessExit, terminateOwnedBrokerProcess).
Error classification predicates
src/main/broker.ts
Adds someInCauseChain and predicates: isBrokerUnreachableError, isBrokerTimeoutError, and isInputStreamClosedError.
BrokerManager revival state
src/main/broker.ts
Adds revivePromises map to dedupe per-project revives and brokerTimeoutCounts map to track consecutive listAgents timeouts.
Session revival and synchronization
src/main/broker.ts
reviveSession(projectId) restarts an owned local broker: skips cloud sessions, dedupes, drops the old session, terminates the owned PID, calls start(...), and returns success/failure. Adds getOrAwaitSession to await in-flight revive/start promises.
ListAgents revive orchestration
src/main/broker.ts
listAgents centralizes polling into collectSessionAgents (clearing brokerTimeoutCounts on success), classifies failures to decide immediate revive, threshold-gated revive after consecutive timeouts, or rethrow; retries once after revive and returns an empty agent list for unrecoverable project sessions.
Supporting error handling and cleanup
src/main/broker.ts
sendInput suppresses warnings for expected input_stream_closed PTY WS races while falling back to HTTP. shutdown now delegates per-project cleanup to dropSession, which clears brokerTimeoutCounts and removes session/agent mappings.

Sequence Diagram

sequenceDiagram
  participant Client
  participant ListAgents
  participant BrokerManager
  participant ErrorClassifier
  participant BrokerSession
  participant Broker

  Client->>ListAgents: listAgents(projectId)
  ListAgents->>BrokerSession: collectSessionAgents()
  BrokerSession->>Broker: request agents
  Broker--xBrokerSession: connection error / timeout
  BrokerSession--xListAgents: bubbled error

  ListAgents->>ErrorClassifier: classify error
  ErrorClassifier-->>ListAgents: unreachable | timeout | other

  alt Unreachable
    ListAgents->>BrokerManager: reviveSession(projectId)
    BrokerManager->>BrokerSession: dropSession + terminateOwnedBrokerProcess()
    BrokerManager->>BrokerSession: start(new port)
    BrokerManager-->>ListAgents: revived?
    ListAgents->>BrokerSession: retry collectSessionAgents()
  else Timeout (below threshold)
    ListAgents->>BrokerManager: increment brokerTimeoutCounts
    ListAgents-->>Client: return other projects' agents
  else Timeout (at/above threshold)
    ListAgents->>BrokerManager: reviveSession(projectId)
  end

  alt Revive succeeds
    BrokerManager->>ListAgents: clear brokerTimeoutCounts
    ListAgents-->>Client: return agents
  else Revive fails
    ListAgents-->>Client: return empty agents for project
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I nudge the broker, sniff the cause,
I hop through stacks and silent flaws.
When timeouts stack and sockets close,
I thump, restart, and off it goes—
A tiny rabbit patching rows.


Note

🎁 Summarized by CodeRabbit Free

Your organization is on the Free plan. CodeRabbit will generate a high-level summary and a walkthrough for each pull request. For a comprehensive line-by-line review, please upgrade your subscription to CodeRabbit Pro by visiting https://app.coderabbit.ai/login.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces self-healing capabilities for local broker processes by detecting unreachable or wedged brokers and automatically restarting them on fresh ports. The review feedback highlights several critical issues to address: refining isBrokerTimeoutError to avoid false positives from standard manual aborts, introducing an asynchronous helper to await active start or revive promises to prevent transient errors during concurrent listAgents calls, and ensuring active revive promises are cleaned up during shutdown to avoid race conditions with manual shutdowns.

Comment thread src/main/broker.ts
Comment on lines +276 to +283
function isBrokerTimeoutError(err: unknown): boolean {
return someInCauseChain(err, (node) => {
if (node.code === 'ETIMEDOUT') return true
if (node.name === 'TimeoutError' || node.name === 'AbortError') return true
const message = node.message
return typeof message === 'string' && /aborted due to timeout|operation was aborted|ETIMEDOUT/i.test(message)
})
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation of isBrokerTimeoutError incorrectly matches standard manual aborts (such as those triggered by component unmounts or user navigation) because it checks for node.name === 'AbortError' and matches the generic message "operation was aborted". This can lead to false positive timeout counts and spurious broker restarts when navigating the app. It should be refined to only match actual timeout errors (TimeoutError, ETIMEDOUT, or messages explicitly mentioning timeout).

Suggested change
function isBrokerTimeoutError(err: unknown): boolean {
return someInCauseChain(err, (node) => {
if (node.code === 'ETIMEDOUT') return true
if (node.name === 'TimeoutError' || node.name === 'AbortError') return true
const message = node.message
return typeof message === 'string' && /aborted due to timeout|operation was aborted|ETIMEDOUT/i.test(message)
})
}
function isBrokerTimeoutError(err: unknown): boolean {
return someInCauseChain(err, (node) => {
if (node.code === 'ETIMEDOUT') return true
if (node.name === 'TimeoutError') return true
const message = node.message
return typeof message === 'string' && /aborted due to timeout|ETIMEDOUT|timed?out/i.test(message)
})
}

Comment thread src/main/broker.ts
Comment on lines +831 to +861
private async reviveSession(projectId: string): Promise<boolean> {
const existing = this.revivePromises.get(projectId)
if (existing) return existing

const session = this.sessions.get(projectId)
if (!session) return false
// Cloud sessions can't be re-spawned locally — they live in a remote
// sandbox and are owned by CloudAgentManager.
if (session.cloudSandboxId) return false
const win = session.window
if (!win || win.isDestroyed()) return false

const { cwd, name, channels } = session
const promise = (async () => {
console.warn(`[broker] Broker for project ${projectId} is unreachable; restarting on a fresh port`)
await this.shutdown(projectId)
await this.start(projectId, cwd, name, win, channels)
return this.sessions.has(projectId)
})()
this.revivePromises.set(projectId, promise)
try {
return await promise
} catch (err) {
console.error(`[broker] Failed to revive broker for project ${projectId}:`, toErrorMessage(err))
return false
} finally {
if (this.revivePromises.get(projectId) === promise) {
this.revivePromises.delete(projectId)
}
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

When a broker is being revived, this.sessions.delete(projectId) is called during shutdown. Any concurrent scoped listAgents(projectId) call will immediately throw "Relay workspace not started — select the project first" because getSessionForProject is synchronous and throws if the session is missing. To prevent these transient errors during background self-healing, we should introduce an asynchronous getOrAwaitSession helper that awaits any active revivePromises or startPromises before returning the session.

Additionally, we should guard against a resurrection race condition where a manual shutdown is overridden by an active background revive by checking if the revive promise is still active before calling this.start.

  private async reviveSession(projectId: string): Promise<boolean> {
    const existing = this.revivePromises.get(projectId)
    if (existing) return existing

    const session = this.sessions.get(projectId)
    if (!session) return false
    // Cloud sessions can't be re-spawned locally — they live in a remote
    // sandbox and are owned by CloudAgentManager.
    if (session.cloudSandboxId) return false
    const win = session.window
    if (!win || win.isDestroyed()) return false

    const { cwd, name, channels } = session
    const promise = (async () => {
      console.warn(`[broker] Broker for project ${projectId} is unreachable; restarting on a fresh port`)
      await this.shutdown(projectId)
      if (this.revivePromises.get(projectId) !== promise) return false
      await this.start(projectId, cwd, name, win, channels)
      return this.sessions.has(projectId)
    })()
    this.revivePromises.set(projectId, promise)
    try {
      return await promise
    } catch (err) {
      console.error(`[broker] Failed to revive broker for project ${projectId}:`, toErrorMessage(err))
      return false
    } finally {
      if (this.revivePromises.get(projectId) === promise) {
        this.revivePromises.delete(projectId)
      }
    }
  }

  private async getOrAwaitSession(projectId: string): Promise<BrokerSession> {
    const normalizedProjectId = projectId.trim()
    const revivePromise = this.revivePromises.get(normalizedProjectId)
    if (revivePromise) {
      await revivePromise.catch(() => undefined)
    }
    const startPromise = this.startPromises.get(normalizedProjectId)
    if (startPromise) {
      await startPromise.catch(() => undefined)
    }
    const session = this.sessions.get(normalizedProjectId)
    if (!session) {
      throw new Error('Relay workspace not started — select the project first')
    }
    return session
  }

Comment thread src/main/broker.ts Outdated
Comment on lines 1883 to 1920
const sessions = projectId ? [this.getSessionForProject(projectId)] : Array.from(this.sessions.values())
const results = await Promise.all(
sessions.map(async (session) => {
const agents = await session.client.listAgents()
for (const agent of agents) {
this.rememberAgentProject(agent.name, session.projectId)
try {
return await this.collectSessionAgents(session)
} catch (err) {
// A dead broker (connection refused) is a definitive signal — respawn
// immediately. A wedged broker (request timeout) might just be a slow
// response, so only respawn after MAX consecutive timeouts; below the
// threshold we rethrow so the renderer keeps its stale agent list
// rather than flickering to empty on a transient blip.
const unreachable = isBrokerUnreachableError(err)
if (!unreachable) {
if (!isBrokerTimeoutError(err)) throw err
const timeouts = (this.brokerTimeoutCounts.get(session.projectId) ?? 0) + 1
this.brokerTimeoutCounts.set(session.projectId, timeouts)
if (timeouts < MAX_BROKER_TIMEOUTS_BEFORE_REVIVE) throw err
console.warn(
`[broker] listAgents: broker for project ${session.projectId} timed out ${timeouts}x; ` +
`restarting it on a fresh port`
)
}
// Restart on a fresh port and retry once against the new session; if
// recovery fails, degrade to an empty list for this project rather
// than failing the whole call (other projects may still be healthy).
this.brokerTimeoutCounts.delete(session.projectId)
const revived = await this.reviveSession(session.projectId)
const next = revived ? this.sessions.get(session.projectId) : undefined
if (!next) {
console.warn(`[broker] listAgents: broker for project ${session.projectId} is unreachable; returning no agents`)
return []
}
return this.collectSessionAgents(next)
}
return Promise.all(
agents.map(async (agent) => {
const inboundDeliveryMode = await session.client.getInboundDeliveryMode(agent.name).catch(() => undefined)
return { ...agent, projectId: session.projectId, inboundDeliveryMode }
})
)
})
)
return results.flat()
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Update listAgents to use the new getOrAwaitSession helper. This ensures that any scoped or unscoped listAgents calls made while a background revive or start is in progress will gracefully await the operation's completion instead of throwing transient errors or silently omitting the project.

    let sessions: BrokerSession[]
    if (projectId) {
      try {
        const session = await this.getOrAwaitSession(projectId)
        sessions = [session]
      } catch (err) {
        throw err
      }
    } else {
      const activeProjectIds = Array.from(new Set([
        ...this.sessions.keys(),
        ...this.revivePromises.keys(),
        ...this.startPromises.keys()
      ]))
      sessions = (await Promise.all(
        activeProjectIds.map((id) => this.getOrAwaitSession(id).catch(() => undefined))
      )).filter((s): s is BrokerSession => !!s)
    }
    const results = await Promise.all(
      sessions.map(async (session) => {
        try {
          return await this.collectSessionAgents(session)
        } catch (err) {
          // A dead broker (connection refused) is a definitive signal — respawn
          // immediately. A wedged broker (request timeout) might just be a slow
          // response, so only respawn after MAX consecutive timeouts; below the
          // threshold we rethrow so the renderer keeps its stale agent list
          // rather than flickering to empty on a transient blip.
          const unreachable = isBrokerUnreachableError(err)
          if (!unreachable) {
            if (!isBrokerTimeoutError(err)) throw err
            const timeouts = (this.brokerTimeoutCounts.get(session.projectId) ?? 0) + 1
            this.brokerTimeoutCounts.set(session.projectId, timeouts)
            if (timeouts < MAX_BROKER_TIMEOUTS_BEFORE_REVIVE) throw err
            console.warn(
              `[broker] listAgents: broker for project ${session.projectId} timed out ${timeouts}x; ` +
              `restarting it on a fresh port`
            )
          }
          // Restart on a fresh port and retry once against the new session; if
          // recovery fails, degrade to an empty list for this project rather
          // than failing the whole call (other projects may still be healthy).
          this.brokerTimeoutCounts.delete(session.projectId)
          const revived = await this.reviveSession(session.projectId)
          const next = revived ? this.sessions.get(session.projectId) : undefined
          if (!next) {
            console.warn(`[broker] listAgents: broker for project ${session.projectId} is unreachable; returning no agents`)
            return []
          }
          return this.collectSessionAgents(next)
        }
      })
    )
    return results.flat()
  }

Comment thread src/main/broker.ts Outdated
Comment on lines 2059 to 2065
const targetProjectIds = projectId ? [projectId] : Array.from(this.sessions.keys())
for (const targetProjectId of targetProjectIds) {
this.closeInputStreamsForProject(targetProjectId)
this.brokerTimeoutCounts.delete(targetProjectId)

const session = this.sessions.get(targetProjectId)
if (!session) continue
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Delete the active revive promise from this.revivePromises during shutdown to ensure that any active background revive is cancelled and does not resurrect the broker session after a manual shutdown has been explicitly requested.

Suggested change
const targetProjectIds = projectId ? [projectId] : Array.from(this.sessions.keys())
for (const targetProjectId of targetProjectIds) {
this.closeInputStreamsForProject(targetProjectId)
this.brokerTimeoutCounts.delete(targetProjectId)
const session = this.sessions.get(targetProjectId)
if (!session) continue
const targetProjectIds = projectId ? [projectId] : Array.from(this.sessions.keys())
for (const targetProjectId of targetProjectIds) {
this.closeInputStreamsForProject(targetProjectId)
this.brokerTimeoutCounts.delete(targetProjectId)
this.revivePromises.delete(targetProjectId)
const session = this.sessions.get(targetProjectId)
if (!session) continue

Copy link
Copy Markdown

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 potential issue.

View 4 additional findings in Devin Review.

Open in Devin Review

Comment thread src/main/broker.ts Outdated
console.warn(`[broker] listAgents: broker for project ${session.projectId} is unreachable; returning no agents`)
return []
}
return this.collectSessionAgents(next)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Unhandled throw from post-revive collectSessionAgents breaks Promise.all for all projects

The comment at line 1905-1907 states the intent: "if recovery fails, degrade to an empty list for this project rather than failing the whole call (other projects may still be healthy)." However, this.collectSessionAgents(next) at line 1915 is not wrapped in a try/catch. If the newly spawned broker is also immediately unresponsive (e.g. port conflict, binary crash), this call throws from inside the catch block, propagating through Promise.all at line 1884 and rejecting the entire listAgents call for ALL sessions. In a multi-project setup, one persistently-failing project prevents agent list updates for all healthy projects on that poll cycle — contradicting the stated graceful-degradation intent.

Suggested change
return this.collectSessionAgents(next)
try {
return await this.collectSessionAgents(next)
} catch {
return []
}
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

@agent-relay-code
Copy link
Copy Markdown
Contributor

pr-reviewer applied fixes — committed and pushed c7837eb to this PR. The notes below describe what changed.

Reviewed PR #67 and fixed the broker recovery path in src/main/broker.ts.

Changes made:

  • Added best-effort owned broker process termination before revive, so a wedged local broker cannot keep the runtime lock and block replacement startup.
  • Changed revive teardown to disconnect/drop the stale session instead of waiting on a broker HTTP shutdown that may itself be wedged.
  • Kept cloud sandbox sessions out of local respawn recovery so remote broker failures are surfaced instead of silently becoming an empty agent list.
  • Factored session cleanup through dropSession() so shutdown and revive clear the same local maps consistently.

Checks run:

  • node --experimental-strip-types --check src/main/broker.ts
  • npm test
  • npm run build

Note: direct npx tsc --noEmit -p tsconfig.node.json still reports existing unrelated repo type errors outside this PR path; the actual build completed successfully.

Copy link
Copy Markdown
Contributor

@agent-relay-code agent-relay-code Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pr-reviewer applied fixes — committed and pushed c7837eb to this PR. The notes below describe what changed.

Reviewed PR #67 and fixed the broker recovery path in src/main/broker.ts.

Changes made:

  • Added best-effort owned broker process termination before revive, so a wedged local broker cannot keep the runtime lock and block replacement startup.
  • Changed revive teardown to disconnect/drop the stale session instead of waiting on a broker HTTP shutdown that may itself be wedged.
  • Kept cloud sandbox sessions out of local respawn recovery so remote broker failures are surfaced instead of silently becoming an empty agent list.
  • Factored session cleanup through dropSession() so shutdown and revive clear the same local maps consistently.

Checks run:

  • node --experimental-strip-types --check src/main/broker.ts
  • npm test
  • npm run build

Note: direct npx tsc --noEmit -p tsconfig.node.json still reports existing unrelated repo type errors outside this PR path; the actual build completed successfully.

@agent-relay-code
Copy link
Copy Markdown
Contributor

pr-reviewer applied fixes — committed and pushed ba16852 to this PR. The notes below describe what changed.

Fixed the PR review findings in src/main/broker.ts:

  • Narrowed broker timeout detection so ordinary AbortError / manual aborts do not count toward broker revive.
  • Added session awaiting during listAgents so active start / revive operations do not cause transient missing-session errors.
  • Prevented manual shutdown from being undone by an in-flight revive.
  • Wrapped the post-revive collectSessionAgents retry so one still-bad project returns [] instead of rejecting all projects.
  • Avoided iterator spreads that break this repo’s current Node TS target.

Checks run:

  • node --experimental-strip-types --check src/main/broker.ts passed
  • npm test passed
  • npm run build passed
  • npx tsc --noEmit -p tsconfig.node.json still fails on existing unrelated repo errors, but no broker.ts errors remain

Copy link
Copy Markdown
Contributor

@agent-relay-code agent-relay-code Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pr-reviewer applied fixes — committed and pushed ba16852 to this PR. The notes below describe what changed.

Fixed the PR review findings in src/main/broker.ts:

  • Narrowed broker timeout detection so ordinary AbortError / manual aborts do not count toward broker revive.
  • Added session awaiting during listAgents so active start / revive operations do not cause transient missing-session errors.
  • Prevented manual shutdown from being undone by an in-flight revive.
  • Wrapped the post-revive collectSessionAgents retry so one still-bad project returns [] instead of rejecting all projects.
  • Avoided iterator spreads that break this repo’s current Node TS target.

Checks run:

  • node --experimental-strip-types --check src/main/broker.ts passed
  • npm test passed
  • npm run build passed
  • npx tsc --noEmit -p tsconfig.node.json still fails on existing unrelated repo errors, but no broker.ts errors remain

@khaliqgant khaliqgant merged commit c9c5dde into main May 31, 2026
1 check passed
khaliqgant added a commit that referenced this pull request Jun 3, 2026
* fix(broker): generalize wedge recovery to all polled reads

PR #67 added self-healing to listAgents, but the broker can wedge a
single HTTP endpoint while others stay live — `/api/pending` hangs while
`/api/spawned` keeps answering (the broker's relaycast event loop runs
fine the whole time). So listAgents never notices, never respawns, and
getPending times out on every 2.5s poll forever, flooding the main log
with "Error occurred in handler for 'broker:get-pending'".

- Extract the listAgents recovery into a reusable withWedgeRecovery()
  helper (refused → respawn now; repeated timeouts → respawn after MAX;
  unrecoverable → degrade to a fallback instead of rejecting). listAgents
  now delegates to it.
- Route getPendingMessages through it with degradeOnTimeout: true (an
  empty held-message list is harmless, so a timeout returns [] rather
  than rejecting and logging) and await getOrAwaitSession so a poll
  racing a respawn doesn't throw "workspace not started".
- Renderer: guard the 2.5s getPending poll with an in-flight ref. A
  wedged broker makes each call hang for its full 30s timeout; without
  the guard the interval stacks ~12 concurrent calls that all time out
  at once — which is why the log filled with hundreds of identical lines.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore: apply pr-reviewer fixes for #68

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: agent-relay-bot[bot] <agent-relay-bot[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant