Broker reliability: half-start recovery + surfaced read failures + doctor diagnostics#916
Conversation
…-logs # Conflicts: # .trajectories/index.json
…r-reliability # Conflicts: # .trajectories/index.json # CHANGELOG.md
…ted/broker-reliability # Conflicts: # .trajectories/index.json # CHANGELOG.md # src/cli/lib/agent-management-listing.ts
Cherry-picks the doctor diagnostic regression from test/doctor-stale-broker-repro (a0ee79bc) without the trajectory churn: docs/doctor-orchestration-repros.md plus doctor.ts/doctor.test.ts assertions that `agent-relay doctor` flags stale connection, unresolved API-key template, and half-started/orphaned broker states. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ed-key brokers Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
📝 WalkthroughWalkthroughThis PR hardens broker session resolution and lifecycle recovery in the Agent Relay CLI. It introduces project-scoped broker connection metadata, eliminates silent agent-listing fallbacks, adds deterministic recovery for half-started brokers, and enhances diagnostics to detect stale or unresolved broker states. ChangesBroker Reliability and Session Resolution
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…eliability # Conflicts: # .trajectories/index.json # CHANGELOG.md # crates/broker/src/runtime/worker_events.rs # src/cli/commands/agent-management.ts # src/cli/commands/doctor.test.ts # src/cli/lib/agent-management-listing.ts # src/cli/lib/doctor.ts
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/cli/commands/agent-management.ts (1)
109-125:⚠️ Potential issue | 🟠 Major | ⚡ Quick winUse project-scoped connection resolution during autostart polling
createSdkClientswitched to project-scoped connect, butwaitForBrokerClientstill usesAgentRelayClient.connect({ cwd })(Line 203). IfAGENT_RELAY_STATE_DIRpoints elsewhere, autostart can poll the wrong broker/session and fail or misroute.Suggested fix
async function waitForBrokerClient( cwd: string, timeoutMs = 15_000, intervalMs = 250 ): Promise<AgentRelayClient> { @@ while (Date.now() < deadline) { try { - const client = AgentRelayClient.connect({ cwd }); + const client = connectProjectBrokerClient(cwd); await waitForReadyBrokerClient(client, Math.max(1, deadline - Date.now()), intervalMs); return client; } catch (err) { lastError = err; await new Promise((resolve) => setTimeout(resolve, intervalMs));Also applies to: 193-205
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cli/commands/agent-management.ts` around lines 109 - 125, createSdkClient currently uses connectProjectBrokerClient(cwd) but when auto-starting it falls back to waitForBrokerClient which still calls AgentRelayClient.connect({ cwd }) and can therefore poll/connect to the wrong broker when AGENT_RELAY_STATE_DIR is overridden; update waitForBrokerClient (and any other autostart polling helpers called from createSdkClient/startBackgroundBroker) to use the same project-scoped connection resolution (i.e., call connectProjectBrokerClient(cwd) or pass the resolved project-specific socket/path) instead of AgentRelayClient.connect({ cwd }) so the autostart polls the correct broker/session; ensure symbols touched include waitForBrokerClient, startBackgroundBroker, createSdkClient, connectProjectBrokerClient and replace or augment any AgentRelayClient.connect usage accordingly.
🧹 Nitpick comments (1)
src/cli/lib/doctor.ts (1)
103-133: 💤 Low valuePotential false positive when matching broker processes by
brokerName.The
findLiveBrokerProcessesfunction matches processes containing--name ${brokerName}wherebrokerNameis derived frompath.basename(projectRoot). If the project folder has a common name likeapp,test, orproject, it could match unrelated broker processes from other projects with the same folder name.This is mitigated by also requiring the command to contain
agent-relay-broker, but projects with identical folder names running brokers would still match incorrectly.Consider this acceptable for diagnostics (false positives are better than false negatives for detecting half-started brokers), but document this limitation if issues arise.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cli/lib/doctor.ts` around lines 103 - 133, The current findLiveBrokerProcesses logic can false-positive on projects sharing a common folder name because it checks process.command.includes(`--name ${brokerName}`); update the matching to only accept an exact --name argument rather than a substring: parse the process.command for tokenized args (in findLiveBrokerProcesses or a helper) and verify there is a --name flag whose value equals brokerName (or equals brokerName when wrapped in quotes), or use a stricter regex that enforces word boundaries around the --name value; reference the functions findLiveBrokerProcesses, parsePsLine and the brokerName variable when making this change so matching is tightened while keeping the existing checks for agent-relay-broker, resolvedProjectRoot and resolvedDataDir.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.trajectories/completed/2026-05/traj_2gpglosdsq7s.json:
- Line 47: The trajectory metadata contains a user-specific absolute path in the
projectId field which can leak usernames; update the projectId entry (the
"projectId" key in the JSON) to use a repo-relative or sanitized identifier
(e.g., the repository name or a canonical ID) instead of the full local path, or
derive it from an environment/config value so local machine paths are never
committed; ensure any commit replaces occurrences of the absolute path with the
sanitized value and add a note or validation to prevent future commits of
absolute user paths.
In `@docs/doctor-orchestration-repros.md`:
- Line 12: Replace the hardcoded absolute CLI path
"CLI=/Users/khaliqgant/Projects/AgentWorkforce/relay/dist/src/cli/index.js" (and
the duplicate at line 110) with a portable reference; for example use a relative
path like "CLI=./dist/src/cli/index.js", a workspace variable like
"CLI=$PWD/dist/src/cli/index.js", or an npm script invocation (e.g., "npm run
cli") so the repro steps work for other contributors and CI.
- Line 21: The repro command references undefined shell variables $CONN1 and
$CONN2, making the script non-runnable; fix by either (a) adding explicit
variable definitions (e.g., export CONN1="path/to/conn1.json" and export
CONN2="path/to/conn2.json") before the node -e invocation, or (b) modify the
command to accept positional parameters (replace "$CONN1" "$CONN2" with "$1"
"$2" and document how to call it), and update the docs to show an example pair
of concrete file paths so node -e '... "$CONN1" "$CONN2"' can run
deterministically.
---
Outside diff comments:
In `@src/cli/commands/agent-management.ts`:
- Around line 109-125: createSdkClient currently uses
connectProjectBrokerClient(cwd) but when auto-starting it falls back to
waitForBrokerClient which still calls AgentRelayClient.connect({ cwd }) and can
therefore poll/connect to the wrong broker when AGENT_RELAY_STATE_DIR is
overridden; update waitForBrokerClient (and any other autostart polling helpers
called from createSdkClient/startBackgroundBroker) to use the same
project-scoped connection resolution (i.e., call connectProjectBrokerClient(cwd)
or pass the resolved project-specific socket/path) instead of
AgentRelayClient.connect({ cwd }) so the autostart polls the correct
broker/session; ensure symbols touched include waitForBrokerClient,
startBackgroundBroker, createSdkClient, connectProjectBrokerClient and replace
or augment any AgentRelayClient.connect usage accordingly.
---
Nitpick comments:
In `@src/cli/lib/doctor.ts`:
- Around line 103-133: The current findLiveBrokerProcesses logic can
false-positive on projects sharing a common folder name because it checks
process.command.includes(`--name ${brokerName}`); update the matching to only
accept an exact --name argument rather than a substring: parse the
process.command for tokenized args (in findLiveBrokerProcesses or a helper) and
verify there is a --name flag whose value equals brokerName (or equals
brokerName when wrapped in quotes), or use a stricter regex that enforces word
boundaries around the --name value; reference the functions
findLiveBrokerProcesses, parsePsLine and the brokerName variable when making
this change so matching is tightened while keeping the existing checks for
agent-relay-broker, resolvedProjectRoot and resolvedDataDir.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 25b61ec2-3855-4370-aede-339e367d3100
📒 Files selected for processing (19)
.trajectories/compacted/release-6.2.3.json.trajectories/compacted/release-6.2.3.md.trajectories/completed/2026-05/traj_2gpglosdsq7s.json.trajectories/completed/2026-05/traj_2gpglosdsq7s.md.trajectories/completed/2026-05/traj_gnqvtoxtc8dy.json.trajectories/completed/2026-05/traj_gnqvtoxtc8dy.mdCHANGELOG.mddocs/doctor-orchestration-repros.mdsrc/cli/commands/agent-management.test.tssrc/cli/commands/agent-management.tssrc/cli/commands/core.test.tssrc/cli/commands/doctor.test.tssrc/cli/commands/messaging.tssrc/cli/lib/agent-management-listing.test.tssrc/cli/lib/agent-management-listing.tssrc/cli/lib/broker-lifecycle.tssrc/cli/lib/doctor.tssrc/cli/lib/project-broker-client.test.tssrc/cli/lib/project-broker-client.ts
| }, | ||
| "commits": [], | ||
| "filesChanged": [], | ||
| "projectId": "/Users/khaliqgant/Projects/AgentWorkforce/relay-worktrees/broker-session-query-and-listagents", |
There was a problem hiding this comment.
Avoid committing user-specific absolute paths in trajectory metadata.
projectId embeds a local username/path, which can leak user identifiers and machine layout in repository history. Prefer repo-relative or sanitized identifiers.
Suggested doc/data fix
- "projectId": "/Users/khaliqgant/Projects/AgentWorkforce/relay-worktrees/broker-session-query-and-listagents",
+ "projectId": "relay-worktrees/broker-session-query-and-listagents",📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "projectId": "/Users/khaliqgant/Projects/AgentWorkforce/relay-worktrees/broker-session-query-and-listagents", | |
| "projectId": "relay-worktrees/broker-session-query-and-listagents", |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.trajectories/completed/2026-05/traj_2gpglosdsq7s.json at line 47, The
trajectory metadata contains a user-specific absolute path in the projectId
field which can leak usernames; update the projectId entry (the "projectId" key
in the JSON) to use a repo-relative or sanitized identifier (e.g., the
repository name or a canonical ID) instead of the full local path, or derive it
from an environment/config value so local machine paths are never committed;
ensure any commit replaces occurrences of the absolute path with the sanitized
value and add a note or validation to prevent future commits of absolute user
paths.
| Run from a temporary project with the built CLI: | ||
|
|
||
| ```bash | ||
| CLI=/Users/khaliqgant/Projects/AgentWorkforce/relay/dist/src/cli/index.js |
There was a problem hiding this comment.
Use a portable CLI path in repro steps.
Hardcoded /Users/.../dist/... paths make this unusable for other contributors and CI doc validation.
Suggested doc fix
-CLI=/Users/khaliqgant/Projects/AgentWorkforce/relay/dist/src/cli/index.js
+CLI="$(pwd)/dist/src/cli/index.js"Also applies to: 110-110
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/doctor-orchestration-repros.md` at line 12, Replace the hardcoded
absolute CLI path
"CLI=/Users/khaliqgant/Projects/AgentWorkforce/relay/dist/src/cli/index.js" (and
the duplicate at line 110) with a portable reference; for example use a relative
path like "CLI=./dist/src/cli/index.js", a workspace variable like
"CLI=$PWD/dist/src/cli/index.js", or an npm script invocation (e.g., "npm run
cli") so the repro steps work for other contributors and CI.
| node "$CLI" who --json | ||
| kill -9 11078 | ||
| node "$CLI" up --no-dashboard --port 49300 | ||
| node -e 'const fs=require("fs"); const old=JSON.parse(process.argv[1]); const next=JSON.parse(process.argv[2]); old.pid=next.pid; fs.writeFileSync(".agent-relay/connection.json", JSON.stringify(old,null,2));' "$CONN1" "$CONN2" |
There was a problem hiding this comment.
Undefined CONN1/CONN2 makes the repro non-runnable.
This command references $CONN1 and $CONN2, but they are never set in the snippet, so the “deterministic” repro will fail as written.
Suggested doc fix
node "$CLI" up --no-dashboard --port 49200
-cat .agent-relay/connection.json
+CONN1="$(cat .agent-relay/connection.json)"
...
node "$CLI" up --no-dashboard --port 49300
+CONN2="$(cat .agent-relay/connection.json)"
node -e 'const fs=require("fs"); const old=JSON.parse(process.argv[1]); const next=JSON.parse(process.argv[2]); old.pid=next.pid; fs.writeFileSync(".agent-relay/connection.json", JSON.stringify(old,null,2));' "$CONN1" "$CONN2"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/doctor-orchestration-repros.md` at line 21, The repro command references
undefined shell variables $CONN1 and $CONN2, making the script non-runnable; fix
by either (a) adding explicit variable definitions (e.g., export
CONN1="path/to/conn1.json" and export CONN2="path/to/conn2.json") before the
node -e invocation, or (b) modify the command to accept positional parameters
(replace "$CONN1" "$CONN2" with "$1" "$2" and document how to call it), and
update the docs to show an example pair of concrete file paths so node -e '...
"$CONN1" "$CONN2"' can run deterministically.
Summary
Three core broker-reliability fixes from real orchestration friction, consolidated and reviewed together. Stacked on PR #914 (
combined/reliability-and-logs) because thedoctorregression depends on #914'sdoctorwork — the review delta here is only the 3 fixes. Merge #914 first; GitHub will retarget this base tomain.Fixes
fa4617c9):agent-relay updetects a half-started/orphaned broker (PID alive but no readable connection metadata / readiness timed out) and recovers — reap+restart or exit non-zero with actionable guidance — instead of silently leaving a live-but-useless broker.agent-relay down --forcealso reapsagent-relay up --foregroundwrappers.5b27e066):who/agentsno longer mask broker/session query failures as an empty list (removed silent.catch(() => [])) — they now print a clear error and exit non-zero.history/repliesresolve the project broker session even whenAGENT_RELAY_STATE_DIRpoints elsewhere (fixes "Failed to query broker session: typo in url or port?" on a healthy broker).d186133c):agent-relay doctorflags stale-connection, unresolved${RELAY_*}API-key template, and half-started/orphaned broker as distinct actionable failures instead of "healthy"; deterministic repros documented indocs/doctor-orchestration-repros.md.These resolve three friction reports where the orchestration read surface was unusable on an apparently-healthy broker.
Review & validation
Independently reviewed (Claude): VERDICT GO — confirmed each fix genuine, the hand-resolved
agent-management-listing.tsmerge combines #914's enrichedwhofields and the new error-surfacing with zero loss, no regression of #914/#892–#895/agents:logs. Validated on the merged branch: typecheck clean, 237 CLI + 631 broker + 54 SDK tests, build green.Non-blocking follow-ups noted by review: commit
d186133cis label-only-misleading (test(doctor):but includes prod doctor logic);isForegroundBrokerCliCommandregex is conservatively AND-gated (negligible risk).🤖 Generated with Claude Code