feat: centralize CLI models + refresh README for open source launch#435
Conversation
- Add --local option to `up` command for future local Relaycast binary - Dashboard auto-creates its own RelayAdapter when projectRoot is provided - Simplify startDashboard call (no relayAdapter param needed from CLI) - Stub logs fallback message until co-founder ships Relaycast binary Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Resolve conflicts from main's version bumps (2.3.14) and new features (agent heartbeat, exit detection, workflows) while preserving the PR's package consolidation and broker SDK refactor. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| if (!apiKey) { | ||
| const raw = await readFile(cachePath, "utf-8"); | ||
| const creds = JSON.parse(raw); | ||
| apiKey = creds.api_key; | ||
| } |
There was a problem hiding this comment.
🔴 createRelaycastClient throws unhandled ENOENT when cache file doesn't exist
When no API key is provided via options or RELAY_API_KEY env var, createRelaycastClient calls readFile(cachePath, "utf-8") without a try/catch. If the cache file ~/.agent-relay/relaycast.json doesn't exist (common for first-time users or CI environments), this throws an unhandled ENOENT error instead of the intended "API key not found" error message on line 40.
Root Cause and Impact
At packages/broker-sdk/src/relaycast.ts:34, the readFile call is not wrapped in a try/catch. The code assumes the file exists and is valid JSON. If the file is missing, the function throws Error: ENOENT: no such file or directory instead of falling through to the descriptive error on line 40: "Relaycast API key not found in options, RELAY_API_KEY, or cache file".
This affects any caller of createRelaycastClient without an API key, including the MCP adapter's getInbox() and getHealth() methods at packages/mcp/src/client-adapter.ts:638 and packages/mcp/src/client-adapter.ts:680.
Actual behavior: Throws ENOENT: no such file or directory, open '/home/user/.agent-relay/relaycast.json'
Expected behavior: Throws "Relaycast API key not found in options, RELAY_API_KEY, or cache file"
| if (!apiKey) { | |
| const raw = await readFile(cachePath, "utf-8"); | |
| const creds = JSON.parse(raw); | |
| apiKey = creds.api_key; | |
| } | |
| if (!apiKey) { | |
| try { | |
| const raw = await readFile(cachePath, "utf-8"); | |
| const creds = JSON.parse(raw); | |
| apiKey = creds.api_key; | |
| } catch { | |
| // Cache file missing or invalid — fall through to the error below | |
| } | |
| } |
Was this helpful? React with 👍 or 👎 to provide feedback.
| @@ -194,9 +204,36 @@ export class AgentRelayClient { | |||
| return result; | |||
| } | |||
|
|
|||
There was a problem hiding this comment.
🟡 spawnPty sends hardcoded runtime "pty" but spawnAgent sends "headless_claude" — inconsistent runtime values
The spawnPty method at packages/broker-sdk/src/client.ts:172 hardcodes runtime: "pty" in the AgentSpec, while the spawnAgent method at line 196 hardcodes runtime: "headless_claude". However, the Rust binary's AgentSpec struct uses an AgentRuntime enum. The spawnPty method always sends "pty" regardless of the CLI being spawned.
Detailed Explanation
At packages/broker-sdk/src/client.ts:168-187, the spawnPty method constructs an AgentSpec with runtime: "pty". This is correct for the general case. However, the new fields model, cwd, team, shadow_of, and shadow_mode are passed through from SpawnPtyInput to AgentSpec correctly.
The concern is that the spawnAgent method at line 190-204 hardcodes runtime: "headless_claude" — this means any agent spawned via spawnAgent is always treated as headless Claude, even if a different CLI is specified. This was pre-existing behavior but is now more visible since the MCP adapter's toAgentInfo function at packages/mcp/src/client-adapter.ts:281 maps headless_claude to cli: 'claude' and everything else to cli: 'pty', which means non-Claude agents spawned via spawnAgent would be incorrectly labeled.
This is a pre-existing issue that becomes more impactful with the migration since more code paths now use these methods.
Was this helpful? React with 👍 or 👎 to provide feedback.
| async waitForAgentReady(name: string, timeoutMs = 60_000): Promise<Agent> { | ||
| const client = await this.ensureStarted(); | ||
| const existing = this.knownAgents.get(name); | ||
| if (existing && this.readyAgents.has(name)) { | ||
| return existing; | ||
| } | ||
|
|
||
| return new Promise<Agent>((resolve, reject) => { | ||
| let settled = false; | ||
| let timeout: ReturnType<typeof setTimeout> | undefined; | ||
|
|
||
| const cleanup = () => { | ||
| unsubscribe(); | ||
| if (timeout) { | ||
| clearTimeout(timeout); | ||
| timeout = undefined; | ||
| } | ||
| }; | ||
|
|
||
| const resolveWith = (agent: Agent) => { | ||
| if (settled) return; | ||
| settled = true; | ||
| cleanup(); | ||
| resolve(agent); | ||
| }; | ||
|
|
||
| const rejectWith = (error: Error) => { | ||
| if (settled) return; | ||
| settled = true; | ||
| cleanup(); | ||
| reject(error); | ||
| }; | ||
|
|
||
| const unsubscribe = client.onEvent((event) => { | ||
| if (event.kind !== "worker_ready" || event.name !== name) { | ||
| return; | ||
| } | ||
| let agent = this.knownAgents.get(event.name); | ||
| if (!agent) { | ||
| agent = this.makeAgent(event.name, event.runtime, []); | ||
| this.knownAgents.set(event.name, agent); | ||
| } | ||
| this.readyAgents.add(event.name); | ||
| resolveWith(agent); | ||
| }); | ||
|
|
||
| timeout = setTimeout(() => { | ||
| rejectWith(new Error(`Timed out waiting for worker_ready for '${name}' after ${timeoutMs}ms`)); | ||
| }, timeoutMs); | ||
|
|
||
| const known = this.knownAgents.get(name); | ||
| if (known && this.readyAgents.has(name)) { | ||
| resolveWith(known); | ||
| } | ||
| }); | ||
| } |
There was a problem hiding this comment.
🟡 waitForAgentReady has a TOCTOU race between initial check and event subscription
In waitForAgentReady, the initial early-return check at lines 337-339 runs before the Promise constructor. Inside the Promise, the event listener is subscribed at line 368, and then a second check is performed at lines 385-388. However, a worker_ready event can arrive between the ensureStarted() call (line 336) and the onEvent subscription (line 368), causing the method to miss the event and time out.
Detailed Explanation of the Race Window
The sequence is:
ensureStarted()— starts the client, which begins receiving events- Check
knownAgents+readyAgents(lines 337-339) — agent not ready yet - Enter Promise constructor
- RACE WINDOW: Between steps 2 and 5, the event loop can process a
worker_readyevent that updatesreadyAgentsvia the main event handler atrelay.ts:506-514 - Subscribe to
onEvent(line 368) — too late, event already processed - Re-check
readyAgents(lines 385-388) — this catches the race!
Actually, looking more carefully, the re-check at lines 385-388 does catch this race condition because it runs synchronously within the same Promise constructor after subscribing. Since the worker_ready event handler at line 506-514 updates readyAgents synchronously, and the re-check at 385-388 reads it synchronously, this is safe.
However, there's still a subtle issue: the check at lines 337-339 and the check at lines 385-388 are identical duplicates. The first check (337-339) is redundant because the second check (385-388) handles the same case. This is not a bug per se, but the duplicate check is misleading — it suggests the author was trying to handle a race condition but the actual protection comes from the second check only.
Was this helpful? React with 👍 or 👎 to provide feedback.
CLI Refactoring: - Split 5361-line src/cli/index.ts into 8 modular command files + 14 shared lib files - Wire bootstrap.ts as new CLI entry point - Remove legacy patterns: tmux, socket/daemon, agents.json, storage adapter - Remove create-agent command (replaced by spawn) - Migrate all spawnPty calls to broker-sdk AgentRelayClient - Rename Rust broker binary from agent-relay to agent-relay-broker - 66 tests with full DI pattern across all modules Broker-SDK Improvements: - Per-agent cwd on spawnPty - Agent.waitForReady() promise - Structured data field in sendMessage - Reason parameter on release() - Model as first-class spawn option - Agent.status computed getter (spawning/ready/idle/exited) - Agent.onOutput() per-agent streaming hook - relay.system() convenience handle SDK Rename: - Rename packages/broker-sdk to packages/sdk (@agent-relay/sdk) - Delete empty packages/sdk-ts - Update all imports, build scripts, CI workflows Workflow Runner Fixes: - Add trajectories and idleThresholdSecs to schema.json - Fix README template table patterns - Extract shared execute/resume logic into runWorkflowCore - Remove global process.env.RELAY_API_KEY mutation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
# Conflicts: # packages/broker-sdk/src/logs.ts # packages/broker-sdk/tsconfig.json # packages/sdk/src/__tests__/idle-nudge.test.ts # packages/sdk/src/__tests__/unit.test.ts # packages/sdk/src/examples/workflow-superiority.ts # packages/sdk/src/relay.ts # packages/sdk/src/workflows/builtin-templates/competitive.yaml # packages/sdk/src/workflows/runner.ts
Tests for gc command and tmux discovery were referencing deleted functionality from the legacy cleanup wave. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Docs (22 files): - Replace create-agent with spawn across all docs - Rename agent-relay binary refs to agent-relay-broker - Update imports from 'agent-relay' to '@agent-relay/sdk' - Remove tmux references and prerequisites - Update socket/legacy references to file-based protocol CLI: - Remove obsolete doctor command (SQLite/JSONL diagnostics no longer needed) - Clean up doctor from monitoring module, bootstrap, and all tests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…o cli-uses-broker
| } else { | ||
| machineId = `${deps.getHostname()}-${deps.randomHex(8)}`; | ||
| fs.mkdirSync(dataDir, { recursive: true }); | ||
| fs.writeFileSync(machineIdPath, machineId); |
Check failure
Code scanning / CodeQL
Potential file system race condition High
| const modelSplit = architectCli.split(':'); | ||
| const cli = modelSplit[0]; | ||
| const model = modelSplit[1]; | ||
| const outboxPath = deps.getAgentOutboxTemplate().replace(/\$/g, '\\$'); |
Check failure
Code scanning / CodeQL
Incomplete string escaping or encoding High
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 3 months ago
In general, the problem arises because the code tries to escape one special character ($) in a string coming from deps.getAgentOutboxTemplate() without considering backslashes or the real context in which the string needs escaping. The correct fix is to (a) avoid ad‑hoc partial escaping and (b) either use a well-tested escaping helper for the actual target context (regex, replacement string, shell, etc.) or, if the string is only used literally, not escape it at all.
Within this snippet, outboxPath is only used inside a prompt string shown to an agent: `Use relay protocol with outbox path ${outboxPath}. ...`. There is no interpretation of $ or \ in such plain text, so escaping $ is unnecessary and misleading. The minimal and safest behavior-preserving fix is to stop doing incomplete escaping and just use the template as-is. That means replacing:
const outboxPath = deps.getAgentOutboxTemplate().replace(/\$/g, '\\$');with:
const outboxPath = deps.getAgentOutboxTemplate();This removes the broken custom escaping without changing any visible semantics, because the prompt will now show the actual template string instead of an odd $→\$-mangled form. No new imports or helper functions are needed; we just simplify the assignment on line 91.
| @@ -88,7 +88,7 @@ | ||
| const modelSplit = architectCli.split(':'); | ||
| const cli = modelSplit[0]; | ||
| const model = modelSplit[1]; | ||
| const outboxPath = deps.getAgentOutboxTemplate().replace(/\$/g, '\\$'); | ||
| const outboxPath = deps.getAgentOutboxTemplate(); | ||
| const prompt = | ||
| 'You are the Architect, a cross-project coordinator.\n\n' + | ||
| `Connected Projects:\n${formatProjectContext(valid)}\n\n` + |
| deps.log(` - ${credential.provider}`); | ||
| } | ||
|
|
||
| fs.writeFileSync(credentialsPath, JSON.stringify(credentials, null, 2)); |
Check warning
Code scanning / CodeQL
Network data written to file Medium
| } | ||
|
|
||
| if (process.platform === 'win32') { | ||
| await execFileAsync('cmd', ['/c', 'start', '', url]); |
Check warning
Code scanning / CodeQL
Indirect uncontrolled command line Medium
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 3 months ago
General fix: avoid invoking cmd with user-controlled arguments. Use an API that does not require shell metacharacter parsing, or change the invoked program to something that directly accepts a URL argument without indirect shell semantics. For opening URLs on Windows, a common pattern is to call explorer.exe with the URL; this runs without an intermediate cmd layer and still opens the user’s default browser.
Concrete best fix here, without changing external behavior:
- Keep using
execFileAsync(already a safe API). - Change the Windows branch of
defaultOpenExternalto executeexplorer.exedirectly instead ofcmd /c start "" url. - This preserves the “open in default browser” behavior while removing the dependency on
cmd’s argument parsing. - No changes are needed elsewhere;
deps.openExternal’s behavior is the same from the caller’s perspective.
Specific change:
-
In
src/cli/commands/cloud.ts, indefaultOpenExternal, replace:if (process.platform === 'win32') { await execFileAsync('cmd', ['/c', 'start', '', url]); return; }
with:
if (process.platform === 'win32') { await execFileAsync('explorer.exe', [url]); return; }
No new imports or helper functions are needed; we reuse the existing execFileAsync defined with promisify(execFile).
| @@ -67,7 +67,7 @@ | ||
| } | ||
|
|
||
| if (process.platform === 'win32') { | ||
| await execFileAsync('cmd', ['/c', 'start', '', url]); | ||
| await execFileAsync('explorer.exe', [url]); | ||
| return; | ||
| } | ||
|
|
Delete all 20 MCP integration test files (now obsolete after @agent-relay/mcp removal). Update test runner to SDK-only, simplify package.json test scripts. Fix e2e permission denied by chmod+x index.js in postbuild. Fix stress test crash from missing packages/resiliency import. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add auto-restart supervisor, metrics collection, and crash insights modules to the Rust broker. Expose new types and methods via both TypeScript and Python SDKs. Remove references to deleted @agent-relay/storage package from CI workflows. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…o cli-uses-broker
spawn_wrap_with_token pre-registers the child agent in Relaycast before spawning the PTY, so the child MCP server starts with a valid RELAY_AGENT_TOKEN already set. Avoids "Not registered" errors for non-claude CLIs (codex, gemini) that don't call register() themselves before using relay tools. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
- agent-relay workflows list: lists built-in YAML templates from packages/sdk/dist/workflows/builtin-templates/ - agent-relay swarm --dry-run: prints execution plan (pattern, task, teams, timeout, cli) without starting the broker or spawning agents Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
- relay-workflow --resume <run-id>: resumes a failed or interrupted run - relay-workflow --validate: runs static analysis without executing - validator.ts: 5 checks — unknown agent ref, step chaining on interactive agent (TUI chrome warning), interactive codex missing /exit, claude without spawn guard on long tasks, non-interactive agent using relay tools Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…put chaining
New test files:
- workflow-noninteractive.test.ts: regression tests for stdin blocking fix,
env var propagation to subprocess, swarm timeoutMs enforcement
- cli-commands.test.ts: end-to-end CLI tests (version, swarm --dry-run,
workflows list, spawn/agents/release lifecycle)
- output-chaining.test.ts: {{steps.X.output}} interpolation end-to-end
Updated:
- workflow-runner.test.ts: assertRunFailed test for failed run transitions
- swarm-errors.test.ts, swarm-dag.test.ts: updated assertions
- events-relaycast.test.ts, trajectory.test.ts: lifecycle test updates
- broker-harness.ts, workflow-harness.ts: harness improvements
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Moves the large inline <style> block from public/index.html into web/static/css/tokens.css. The HTML now references it via <link>. Reduces HTML size and makes the styles independently editable. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…ests Non-claude CLIs (codex, gemini, etc.) must call register() before using any relay MCP tools. WorkflowRunner now injects a RELAY SETUP preamble into every non-claude interactive agent task reminding it to register first. Pairs with the wrap.rs pre-registration change (child agents receive a RELAY_AGENT_TOKEN so the MCP server starts already authenticated). - workflow-relay-tools.test.ts: integration tests verifying agents can actually use Relaycast tools end-to-end - relay.workflow-hardening.yaml: 6-hour matrix workflow exercising all swarm patterns across all CLIs and model permutations Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Rebuilt from source after auth.rs, main.rs, helpers.rs, pty_worker.rs, spawner.rs, and wrap.rs changes in this branch. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Replace mechanical event logging with reasoning-focused trajectory content:
Intent at start: records workflow description (the "why") and swarm
pattern rather than just step count.
Step intent: captures first sentence of task as the step's purpose,
not just "assigned to agent X".
Failure root-cause: classifies failures as timeout / verification_mismatch
/ spawn_failed / exit_nonzero / aborted and generates actionable diagnosis.
E.g. non-interactive timeout → "pre-read files in a deterministic step";
verification mismatch → "task prompt may not clearly specify output format".
Completion signal: records the last meaningful output line (the sentinel)
rather than a raw 200-char preview.
Retrospective: buildRunSummary produces a failure narrative ("failed at
'analyze' [timeout], caused 3 downstream steps to be skipped") instead of
statistics. extractLearnings generates actionable recommendations per
failure type. extractChallenges includes root-cause diagnosis, not just
the error message. approach field reflects actual pattern + agent count.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…st logs
trajectory-quality.test.ts (3 integration tests):
1. Success path: asserts trajectory has Purpose (from description), step
intent (first sentence of task), completion sentinel in finding event,
narrative summary, and approach from actual swarm pattern
2. Verification mismatch: asserts failure classified as
verification_mismatch, sentinel named in diagnosis, actionable
learning about output format, failure narrative (not stat string)
3. Raw field: asserts raw.cause is machine-readable string so dashboards
and future agents can consume it without parsing prose
test-trajectory-quality.yaml:
Runnable workflow counterpart — 'work' succeeds, 'fail-known' deliberately
fails with a verification mismatch, 'check-trajectory' deterministic step
reads the trajectory and asserts properties inline via Node.js. Uses
onError: continue so check-trajectory runs even when fail-known fails.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…file interference
… written on completion)
…erall failure; TS tests cover verification_mismatch
relay.bug-fixes.yaml: - maxConcurrency: 5 → 3 (avoid broker saturation) - red-team: removed interactive: false (complex multi-tool review task) - maxIterations → retries throughout (maxIterations is not a valid field) - red-team verification: RED_TEAM_CLEAN|RED_TEAM_ISSUES → RED_TEAM_ prefix (output_contains is literal, not regex) - Added "Do NOT use relay_spawn or add_agent" to all 6 agent tasks - Removed dead coordination.barriers, state, errorHandling blocks runner.ts: - Log observer URL after workspace created: Observer: https://observer.relaycast.dev/?key=<RELAY_API_KEY> Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…ighter prompts - All agent tasks now say "print directly to your terminal" and "do NOT relay_send" to prevent sentinels being sent as channel messages (which verification cannot see) - write-*-reproductions: explicit "do NOT re-investigate, use injected findings only" to prevent agents looping through source files again - maxIterations → retries on remaining steps - Added Wave 10: summary step — red-team agent synthesises all step outputs into a human-readable recap at the end of the run - Note: resume not supported (InMemoryWorkflowDb; start fresh) Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…tion + dashboard tests already done)
- Non-interactive agent presets (lead/worker/reviewer/analyst) with table
- Pre-inject pattern: deterministic step reads file, injects via {{steps.X.output}}
- DAG deadlock anti-pattern: workers must not depend on the lead step
- Updated common mistakes table with new failure modes discovered in testing
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…o cli-uses-broker
- RelayErrorCode: 'agent_already_exists'/'channel_already_exists' → 'name_conflict' - Message fields: agent_name/created_at → agentName/createdAt - Agent fields: last_seen → lastSeen - Channel fields: member_count/created_at/is_archived → memberCount/createdAt/isArchived - getMessages() return type updated to camelCase - Local interfaces (RelaycastAgent/Channel/Message) updated to camelCase Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Summary
This PR prepares the relay repo for open source launch with:
Centralized CLI Registry
packages/shared/cli-registry.yaml@agent-relay/configModels.Claude.OPUS,Models.Codex.CODEX_5_3, etc.)README Refresh
Cleanup
scripts/migrate-to-broker-sdk.ts(migration complete)TRAIL_GIT_AUTH_FIX.mdto.trajectories/completed/.worktrees/to.gitignoreTest plan
npm run codegento confirm generation works🤖 Generated with Claude Code