Skip to content

Broker reliability: half-start recovery + surfaced read failures + doctor diagnostics#916

Merged
khaliqgant merged 15 commits into
mainfrom
consolidated/broker-reliability
May 20, 2026
Merged

Broker reliability: half-start recovery + surfaced read failures + doctor diagnostics#916
khaliqgant merged 15 commits into
mainfrom
consolidated/broker-reliability

Conversation

@khaliqgant
Copy link
Copy Markdown
Member

Summary

Three core broker-reliability fixes from real orchestration friction, consolidated and reviewed together. Stacked on PR #914 (combined/reliability-and-logs) because the doctor regression depends on #914's doctor work — the review delta here is only the 3 fixes. Merge #914 first; GitHub will retarget this base to main.

Fixes

  • Half-started broker recovery (fa4617c9): agent-relay up detects a half-started/orphaned broker (PID alive but no readable connection metadata / readiness timed out) and recovers — reap+restart or exit non-zero with actionable guidance — instead of silently leaving a live-but-useless broker. agent-relay down --force also reaps agent-relay up --foreground wrappers.
  • Surfaced broker read failures (5b27e066): who/agents no longer mask broker/session query failures as an empty list (removed silent .catch(() => [])) — they now print a clear error and exit non-zero. history/replies resolve the project broker session even when AGENT_RELAY_STATE_DIR points elsewhere (fixes "Failed to query broker session: typo in url or port?" on a healthy broker).
  • Doctor diagnostics (d186133c): agent-relay doctor flags stale-connection, unresolved ${RELAY_*} API-key template, and half-started/orphaned broker as distinct actionable failures instead of "healthy"; deterministic repros documented in docs/doctor-orchestration-repros.md.

These resolve three friction reports where the orchestration read surface was unusable on an apparently-healthy broker.

Review & validation

Independently reviewed (Claude): VERDICT GO — confirmed each fix genuine, the hand-resolved agent-management-listing.ts merge combines #914's enriched who fields and the new error-surfacing with zero loss, no regression of #914/#892#895/agents:logs. Validated on the merged branch: typecheck clean, 237 CLI + 631 broker + 54 SDK tests, build green.

Non-blocking follow-ups noted by review: commit d186133c is label-only-misleading (test(doctor): but includes prod doctor logic); isForegroundBrokerCliCommand regex is conservatively AND-gated (negligible risk).

🤖 Generated with Claude Code

Proactive Runtime Bot and others added 14 commits May 19, 2026 10:38
…-logs

# Conflicts:
#	.trajectories/index.json
…r-reliability

# Conflicts:
#	.trajectories/index.json
#	CHANGELOG.md
…ted/broker-reliability

# Conflicts:
#	.trajectories/index.json
#	CHANGELOG.md
#	src/cli/lib/agent-management-listing.ts
Cherry-picks the doctor diagnostic regression from
test/doctor-stale-broker-repro (a0ee79bc) without the trajectory churn:
docs/doctor-orchestration-repros.md plus doctor.ts/doctor.test.ts
assertions that `agent-relay doctor` flags stale connection, unresolved
API-key template, and half-started/orphaned broker states.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ed-key brokers

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@khaliqgant khaliqgant requested a review from willwashburn as a code owner May 19, 2026 13:29
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 19, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

This PR hardens broker session resolution and lifecycle recovery in the Agent Relay CLI. It introduces project-scoped broker connection metadata, eliminates silent agent-listing fallbacks, adds deterministic recovery for half-started brokers, and enhances diagnostics to detect stale or unresolved broker states.

Changes

Broker Reliability and Session Resolution

Layer / File(s) Summary
Project-scoped broker connection foundation
src/cli/lib/project-broker-client.ts, src/cli/lib/project-broker-client.test.ts
New module computes .agent-relay/connection.json path relative to project root and provides connectProjectBrokerClient(projectRoot) helper to isolate broker session resolution from global AGENT_RELAY_STATE_DIR environment variable.
Agent listing error reporting
src/cli/lib/agent-management-listing.ts, src/cli/lib/agent-management-listing.test.ts, src/cli/commands/agent-management.ts, src/cli/commands/agent-management.test.ts
New formatListAgentsError helper produces consistent terminal-safe error messages; runAgentsCommand and runWhoCommand now log formatted errors and exit with code 1 instead of silently falling back to empty agent lists; agents:kill command fails immediately on listing errors instead of continuing with empty list.
Broker connection initialization updates
src/cli/commands/agent-management.ts, src/cli/commands/messaging.ts
All broker "connect existing" call sites switched from AgentRelayClient.connect({ cwd }) to connectProjectBrokerClient(cwd) to use project-local connection metadata across createSdkClient, createDefaultClient, and resolveRelaycastApiKey entry points.
Broker half-start recovery and orphan cleanup
src/cli/lib/broker-lifecycle.ts, src/cli/commands/core.test.ts
New recoverHalfStartedBroker detects and recovers from detached brokers that fail to become API-ready; terminateProcess helper escalates from SIGTERM to SIGKILL when force enabled; killOrphanedBrokerProcesses now matches both binary and foreground CLI orphans and returns matched/killed counts; runUpCommand gates background startup on recovery success and performs forced cleanup on readiness timeout.
Doctor broker diagnostics enhancement
src/cli/lib/doctor.ts, src/cli/commands/doctor.test.ts
Enhanced checkBrokerReliability reads and validates broker connection metadata from .agent-relay/connection.json, detects stale/unreachable broker PIDs via ps, checks for unresolved RELAY_API_KEY templates, and produces failure results with specific remediation guidance for missing, invalid, or stale broker metadata states.
Documentation and trajectory records
CHANGELOG.md, docs/doctor-orchestration-repros.md, .trajectories/completed/2026-05/*
CHANGELOG extended with four new Unreleased fixes; docs/doctor-orchestration-repros.md documents three deterministic reproduction scenarios (stale broker connection, unresolved API key template, half-started broker with missing metadata); completed trajectory files record broker session read path fixes and half-start recovery implementation.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • AgentWorkforce/relay#855: Both PRs modify broker lifecycle orchestration in src/cli/lib/broker-lifecycle.ts, including detached headless broker startup classification, readiness polling, orphan cleanup behavior, and runUpCommand/runStatusCommand termination logic.
  • AgentWorkforce/relay#846: Both PRs update src/cli/commands/agent-management.ts's broker connection initialization in createSdkClient to resolve the correct project-scoped broker session.

Suggested reviewers

  • willwashburn
  • barryollama

Poem

🐰 A rabbit hops through broker states,

Recovers half-starts, no longer waits,

Project paths guide the connection way,

Listing errors sing, no silent fray,

Doctor diagnoses what's gone astray! 🏥

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 5.88% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the three main changes: broker half-start recovery, surfaced read failures, and doctor diagnostics improvements.
Description check ✅ Passed The description includes a clear summary of the three core fixes, a test plan checkbox section, and thorough context about dependencies and validation, meeting the template requirements.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch consolidated/broker-reliability

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Base automatically changed from combined/reliability-and-logs to main May 19, 2026 13:41
willwashburn
willwashburn previously approved these changes May 19, 2026
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 8 additional findings.

Open in Devin Review

…eliability

# Conflicts:
#	.trajectories/index.json
#	CHANGELOG.md
#	crates/broker/src/runtime/worker_events.rs
#	src/cli/commands/agent-management.ts
#	src/cli/commands/doctor.test.ts
#	src/cli/lib/agent-management-listing.ts
#	src/cli/lib/doctor.ts
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/cli/commands/agent-management.ts (1)

109-125: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use project-scoped connection resolution during autostart polling

createSdkClient switched to project-scoped connect, but waitForBrokerClient still uses AgentRelayClient.connect({ cwd }) (Line 203). If AGENT_RELAY_STATE_DIR points elsewhere, autostart can poll the wrong broker/session and fail or misroute.

Suggested fix
 async function waitForBrokerClient(
   cwd: string,
   timeoutMs = 15_000,
   intervalMs = 250
 ): Promise<AgentRelayClient> {
@@
   while (Date.now() < deadline) {
     try {
-      const client = AgentRelayClient.connect({ cwd });
+      const client = connectProjectBrokerClient(cwd);
       await waitForReadyBrokerClient(client, Math.max(1, deadline - Date.now()), intervalMs);
       return client;
     } catch (err) {
       lastError = err;
       await new Promise((resolve) => setTimeout(resolve, intervalMs));

Also applies to: 193-205

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cli/commands/agent-management.ts` around lines 109 - 125, createSdkClient
currently uses connectProjectBrokerClient(cwd) but when auto-starting it falls
back to waitForBrokerClient which still calls AgentRelayClient.connect({ cwd })
and can therefore poll/connect to the wrong broker when AGENT_RELAY_STATE_DIR is
overridden; update waitForBrokerClient (and any other autostart polling helpers
called from createSdkClient/startBackgroundBroker) to use the same
project-scoped connection resolution (i.e., call connectProjectBrokerClient(cwd)
or pass the resolved project-specific socket/path) instead of
AgentRelayClient.connect({ cwd }) so the autostart polls the correct
broker/session; ensure symbols touched include waitForBrokerClient,
startBackgroundBroker, createSdkClient, connectProjectBrokerClient and replace
or augment any AgentRelayClient.connect usage accordingly.
🧹 Nitpick comments (1)
src/cli/lib/doctor.ts (1)

103-133: 💤 Low value

Potential false positive when matching broker processes by brokerName.

The findLiveBrokerProcesses function matches processes containing --name ${brokerName} where brokerName is derived from path.basename(projectRoot). If the project folder has a common name like app, test, or project, it could match unrelated broker processes from other projects with the same folder name.

This is mitigated by also requiring the command to contain agent-relay-broker, but projects with identical folder names running brokers would still match incorrectly.

Consider this acceptable for diagnostics (false positives are better than false negatives for detecting half-started brokers), but document this limitation if issues arise.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cli/lib/doctor.ts` around lines 103 - 133, The current
findLiveBrokerProcesses logic can false-positive on projects sharing a common
folder name because it checks process.command.includes(`--name ${brokerName}`);
update the matching to only accept an exact --name argument rather than a
substring: parse the process.command for tokenized args (in
findLiveBrokerProcesses or a helper) and verify there is a --name flag whose
value equals brokerName (or equals brokerName when wrapped in quotes), or use a
stricter regex that enforces word boundaries around the --name value; reference
the functions findLiveBrokerProcesses, parsePsLine and the brokerName variable
when making this change so matching is tightened while keeping the existing
checks for agent-relay-broker, resolvedProjectRoot and resolvedDataDir.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.trajectories/completed/2026-05/traj_2gpglosdsq7s.json:
- Line 47: The trajectory metadata contains a user-specific absolute path in the
projectId field which can leak usernames; update the projectId entry (the
"projectId" key in the JSON) to use a repo-relative or sanitized identifier
(e.g., the repository name or a canonical ID) instead of the full local path, or
derive it from an environment/config value so local machine paths are never
committed; ensure any commit replaces occurrences of the absolute path with the
sanitized value and add a note or validation to prevent future commits of
absolute user paths.

In `@docs/doctor-orchestration-repros.md`:
- Line 12: Replace the hardcoded absolute CLI path
"CLI=/Users/khaliqgant/Projects/AgentWorkforce/relay/dist/src/cli/index.js" (and
the duplicate at line 110) with a portable reference; for example use a relative
path like "CLI=./dist/src/cli/index.js", a workspace variable like
"CLI=$PWD/dist/src/cli/index.js", or an npm script invocation (e.g., "npm run
cli") so the repro steps work for other contributors and CI.
- Line 21: The repro command references undefined shell variables $CONN1 and
$CONN2, making the script non-runnable; fix by either (a) adding explicit
variable definitions (e.g., export CONN1="path/to/conn1.json" and export
CONN2="path/to/conn2.json") before the node -e invocation, or (b) modify the
command to accept positional parameters (replace "$CONN1" "$CONN2" with "$1"
"$2" and document how to call it), and update the docs to show an example pair
of concrete file paths so node -e '... "$CONN1" "$CONN2"' can run
deterministically.

---

Outside diff comments:
In `@src/cli/commands/agent-management.ts`:
- Around line 109-125: createSdkClient currently uses
connectProjectBrokerClient(cwd) but when auto-starting it falls back to
waitForBrokerClient which still calls AgentRelayClient.connect({ cwd }) and can
therefore poll/connect to the wrong broker when AGENT_RELAY_STATE_DIR is
overridden; update waitForBrokerClient (and any other autostart polling helpers
called from createSdkClient/startBackgroundBroker) to use the same
project-scoped connection resolution (i.e., call connectProjectBrokerClient(cwd)
or pass the resolved project-specific socket/path) instead of
AgentRelayClient.connect({ cwd }) so the autostart polls the correct
broker/session; ensure symbols touched include waitForBrokerClient,
startBackgroundBroker, createSdkClient, connectProjectBrokerClient and replace
or augment any AgentRelayClient.connect usage accordingly.

---

Nitpick comments:
In `@src/cli/lib/doctor.ts`:
- Around line 103-133: The current findLiveBrokerProcesses logic can
false-positive on projects sharing a common folder name because it checks
process.command.includes(`--name ${brokerName}`); update the matching to only
accept an exact --name argument rather than a substring: parse the
process.command for tokenized args (in findLiveBrokerProcesses or a helper) and
verify there is a --name flag whose value equals brokerName (or equals
brokerName when wrapped in quotes), or use a stricter regex that enforces word
boundaries around the --name value; reference the functions
findLiveBrokerProcesses, parsePsLine and the brokerName variable when making
this change so matching is tightened while keeping the existing checks for
agent-relay-broker, resolvedProjectRoot and resolvedDataDir.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 25b61ec2-3855-4370-aede-339e367d3100

📥 Commits

Reviewing files that changed from the base of the PR and between 9c1849b and ed4241e.

📒 Files selected for processing (19)
  • .trajectories/compacted/release-6.2.3.json
  • .trajectories/compacted/release-6.2.3.md
  • .trajectories/completed/2026-05/traj_2gpglosdsq7s.json
  • .trajectories/completed/2026-05/traj_2gpglosdsq7s.md
  • .trajectories/completed/2026-05/traj_gnqvtoxtc8dy.json
  • .trajectories/completed/2026-05/traj_gnqvtoxtc8dy.md
  • CHANGELOG.md
  • docs/doctor-orchestration-repros.md
  • src/cli/commands/agent-management.test.ts
  • src/cli/commands/agent-management.ts
  • src/cli/commands/core.test.ts
  • src/cli/commands/doctor.test.ts
  • src/cli/commands/messaging.ts
  • src/cli/lib/agent-management-listing.test.ts
  • src/cli/lib/agent-management-listing.ts
  • src/cli/lib/broker-lifecycle.ts
  • src/cli/lib/doctor.ts
  • src/cli/lib/project-broker-client.test.ts
  • src/cli/lib/project-broker-client.ts

},
"commits": [],
"filesChanged": [],
"projectId": "/Users/khaliqgant/Projects/AgentWorkforce/relay-worktrees/broker-session-query-and-listagents",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid committing user-specific absolute paths in trajectory metadata.

projectId embeds a local username/path, which can leak user identifiers and machine layout in repository history. Prefer repo-relative or sanitized identifiers.

Suggested doc/data fix
-  "projectId": "/Users/khaliqgant/Projects/AgentWorkforce/relay-worktrees/broker-session-query-and-listagents",
+  "projectId": "relay-worktrees/broker-session-query-and-listagents",
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"projectId": "/Users/khaliqgant/Projects/AgentWorkforce/relay-worktrees/broker-session-query-and-listagents",
"projectId": "relay-worktrees/broker-session-query-and-listagents",
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.trajectories/completed/2026-05/traj_2gpglosdsq7s.json at line 47, The
trajectory metadata contains a user-specific absolute path in the projectId
field which can leak usernames; update the projectId entry (the "projectId" key
in the JSON) to use a repo-relative or sanitized identifier (e.g., the
repository name or a canonical ID) instead of the full local path, or derive it
from an environment/config value so local machine paths are never committed;
ensure any commit replaces occurrences of the absolute path with the sanitized
value and add a note or validation to prevent future commits of absolute user
paths.

Run from a temporary project with the built CLI:

```bash
CLI=/Users/khaliqgant/Projects/AgentWorkforce/relay/dist/src/cli/index.js
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use a portable CLI path in repro steps.

Hardcoded /Users/.../dist/... paths make this unusable for other contributors and CI doc validation.

Suggested doc fix
-CLI=/Users/khaliqgant/Projects/AgentWorkforce/relay/dist/src/cli/index.js
+CLI="$(pwd)/dist/src/cli/index.js"

Also applies to: 110-110

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/doctor-orchestration-repros.md` at line 12, Replace the hardcoded
absolute CLI path
"CLI=/Users/khaliqgant/Projects/AgentWorkforce/relay/dist/src/cli/index.js" (and
the duplicate at line 110) with a portable reference; for example use a relative
path like "CLI=./dist/src/cli/index.js", a workspace variable like
"CLI=$PWD/dist/src/cli/index.js", or an npm script invocation (e.g., "npm run
cli") so the repro steps work for other contributors and CI.

node "$CLI" who --json
kill -9 11078
node "$CLI" up --no-dashboard --port 49300
node -e 'const fs=require("fs"); const old=JSON.parse(process.argv[1]); const next=JSON.parse(process.argv[2]); old.pid=next.pid; fs.writeFileSync(".agent-relay/connection.json", JSON.stringify(old,null,2));' "$CONN1" "$CONN2"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Undefined CONN1/CONN2 makes the repro non-runnable.

This command references $CONN1 and $CONN2, but they are never set in the snippet, so the “deterministic” repro will fail as written.

Suggested doc fix
 node "$CLI" up --no-dashboard --port 49200
-cat .agent-relay/connection.json
+CONN1="$(cat .agent-relay/connection.json)"
 ...
 node "$CLI" up --no-dashboard --port 49300
+CONN2="$(cat .agent-relay/connection.json)"
 node -e 'const fs=require("fs"); const old=JSON.parse(process.argv[1]); const next=JSON.parse(process.argv[2]); old.pid=next.pid; fs.writeFileSync(".agent-relay/connection.json", JSON.stringify(old,null,2));' "$CONN1" "$CONN2"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/doctor-orchestration-repros.md` at line 21, The repro command references
undefined shell variables $CONN1 and $CONN2, making the script non-runnable; fix
by either (a) adding explicit variable definitions (e.g., export
CONN1="path/to/conn1.json" and export CONN2="path/to/conn2.json") before the
node -e invocation, or (b) modify the command to accept positional parameters
(replace "$CONN1" "$CONN2" with "$1" "$2" and document how to call it), and
update the docs to show an example pair of concrete file paths so node -e '...
"$CONN1" "$CONN2"' can run deterministically.

@khaliqgant khaliqgant merged commit 817401d into main May 20, 2026
51 of 52 checks passed
@khaliqgant khaliqgant deleted the consolidated/broker-reliability branch May 20, 2026 07:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants