feat: add Rust broker and sdk-ts (additive, no removals)#424
feat: add Rust broker and sdk-ts (additive, no removals)#424willwashburn merged 45 commits intomainfrom
Conversation
Add relay-broker Rust implementation and TypeScript SDK alongside existing codebase. This is an additive-only change for incremental migration - no existing packages or code removed. Added: - Cargo.toml/Cargo.lock for Rust project - src/*.rs - Rust broker implementation (21 files) - packages/sdk-ts - TypeScript SDK for broker - .github/workflows/rust-ci.yml - Rust CI workflow - .claude/rules/rust.md - Rust coding guidelines Existing TypeScript packages remain unchanged and functional. Integration will happen gradually over time. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
|
||
| function runQualityChecks(): { passed: boolean; output: string } { | ||
| try { | ||
| const output = execSync(QUALITY_CMD, { encoding: "utf-8", stdio: "pipe" }); |
Check warning
Code scanning / CodeQL
Indirect uncontrolled command line Medium
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 2 months ago
In general, to fix this kind of issue you want to avoid passing untrusted data as a single shell command string. Prefer APIs that either (a) do not use a shell or (b) accept an argument vector ([cmd, arg1, arg2, ...]) instead of a concatenated string, and/or restrict the allowed commands to a safe set.
In this file, QUALITY_CMD is used only to run quality checks. The safest, least-invasive fix is:
- Parse
QUALITY_CMDinto a program name and arguments without invoking a shell. - Use
child_process.execFileSync(orspawnSync) with that parsed array, which does not perform shell expansion. - Keep the default
QUALITY_CMD = "npm run check"working, and also support simple overridable commands like"pnpm lint".
To achieve this while avoiding writing our own shell parser, we can use the well-known shell-quote library (as suggested in the background). shell-quote.parse will convert a string like "npm run check" into ["npm", "run", "check"] while honoring quotes. Then we can call:
const parts = shellQuote.parse(QUALITY_CMD).map(String);
const [cmd, ...args] = parts;
execFileSync(cmd, args, { encoding: "utf-8", stdio: "pipe" });This removes the shell from the picture but preserves functional behavior for reasonable QUALITY_CMD values. If QUALITY_CMD is empty or malformed we can fall back to a safe default (npm, ["run", "check"]).
Concretely:
- Add an import for
shell-quoteat the top ofpackages/sdk-ts/src/examples/ralph-loop.ts. - Replace the
execSync(QUALITY_CMD, ...)call inrunQualityCheckswith:- parsing
QUALITY_CMDusingshellQuote.parse, - deriving
cmdandargs, - using
execFileSyncinstead ofexecSyncwith those arguments.
- parsing
- Keep error handling the same (capture
stdoutfrom the error when possible).
This change removes the uncontrolled shell command issue without altering the semantics for normal use and respects the existing configuration mechanism.
| @@ -26,7 +26,8 @@ | ||
| * https://ghuntley.com/ralph/ | ||
| */ | ||
| import fs from "node:fs"; | ||
| import { execSync } from "node:child_process"; | ||
| import { execFileSync } from "node:child_process"; | ||
| import shellQuote from "shell-quote"; | ||
| import { AgentRelay, type Agent, type Message } from "../relay.js"; | ||
|
|
||
| // ── Types ─────────────────────────────────────────────────────────────────── | ||
| @@ -81,7 +82,9 @@ | ||
|
|
||
| function runQualityChecks(): { passed: boolean; output: string } { | ||
| try { | ||
| const output = execSync(QUALITY_CMD, { encoding: "utf-8", stdio: "pipe" }); | ||
| const parsed = shellQuote.parse(QUALITY_CMD).map(String); | ||
| const [cmd, ...args] = parsed.length > 0 ? parsed : ["npm", "run", "check"]; | ||
| const output = execFileSync(cmd, args, { encoding: "utf-8", stdio: "pipe" }); | ||
| return { passed: true, output }; | ||
| } catch (err: unknown) { | ||
| const output = (err as { stdout?: string }).stdout ?? String(err); |
| @@ -28,6 +28,7 @@ | ||
| "typescript": "^5.7.3" | ||
| }, | ||
| "dependencies": { | ||
| "@relaycast/sdk": "^0.2.1" | ||
| "@relaycast/sdk": "^0.2.1", | ||
| "shell-quote": "^1.8.3" | ||
| } | ||
| } |
| Package | Version | Security advisories |
| shell-quote (npm) | 1.8.3 | None |
Remove `cargo test --test 'stress_*'` since there are no stress tests in the codebase yet. This was causing CI failures. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The sdk-ts package doesn't have a package-lock.json since it's newly added and not yet integrated into root workspace. Use npm install instead of npm ci, and set working-directory. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The existing packages/sdk already has the name @agent-relay/sdk. Renamed new package to @agent-relay/broker-sdk to avoid conflict. Also updated package-lock.json to include the new workspace. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The build script was trying to run `cargo build` but Rust isn't installed in the standard Node.js CI jobs. Split the build: - `build` - TypeScript compilation only (no Rust required) - `build:full` - TypeScript + Rust binary bundling Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
Devin Review found 1 new potential issue.
🐛 1 issue in files not directly in the diff
🐛 Byte-index slicing of mapped.text panics on multi-byte UTF-8 characters (src/main.rs:1462)
The expression &mapped.text[..120] at src/main.rs:1462 uses byte-based indexing to truncate the relay message text for display. If the 120th byte falls in the middle of a multi-byte UTF-8 character (e.g., emoji, CJK, accented characters), this will panic at runtime with byte index 120 is not a char boundary.
Root Cause and Impact
The code checks mapped.text.len() > 120 (which is a byte length check) and then slices at byte position 120 with &mapped.text[..120]. For ASCII-only text this works fine, but any multi-byte character spanning the boundary will cause a panic.
The codebase already has a floor_char_boundary helper function (defined at src/main.rs:2910) that safely finds the nearest valid character boundary. The fix is to use it:
let boundary = floor_char_boundary(&mapped.text, 120);
format!("{}…", &mapped.text[..boundary])Impact: The broker process crashes when a relay message containing multi-byte characters is received in listen mode, killing all managed agent sessions.
View 13 additional findings in Devin Review.
The broker-sdk tests use Node.js built-in test runner, not vitest. Exclude packages/sdk-ts from vitest config to prevent test failures. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
Devin Review found 1 new potential issue.
🐛 1 issue in files not directly in the diff
🐛 Byte-index slicing of mapped.text panics on multi-byte UTF-8 characters (src/main.rs:1462)
The expression &mapped.text[..120] at src/main.rs:1462 uses byte-based indexing to truncate the relay message text for display. If the 120th byte falls in the middle of a multi-byte UTF-8 character (e.g., emoji, CJK, accented characters), this will panic at runtime with byte index 120 is not a char boundary.
Root Cause and Impact
The code checks mapped.text.len() > 120 (which is a byte length check) and then slices at byte position 120 with &mapped.text[..120]. For ASCII-only text this works fine, but any multi-byte character spanning the boundary will cause a panic.
The codebase already has a floor_char_boundary helper function (defined at src/main.rs:2910) that safely finds the nearest valid character boundary. The fix is to use it:
let boundary = floor_char_boundary(&mapped.text, 120);
format!("{}…", &mapped.text[..boundary])Impact: The broker process crashes when a relay message containing multi-byte characters is received in listen mode, killing all managed agent sessions.
View 13 additional findings in Devin Review.
Applied rustfmt formatting: - telemetry.rs: Break long method chains across lines - main.rs: Format function signature across multiple lines Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
Devin Review found 1 new potential issue.
🐛 1 issue in files not directly in the diff
🐛 Byte-index slicing of mapped.text panics on multi-byte UTF-8 characters (src/main.rs:1462)
The expression &mapped.text[..120] at src/main.rs:1462 uses byte-based indexing to truncate the relay message text for display. If the 120th byte falls in the middle of a multi-byte UTF-8 character (e.g., emoji, CJK, accented characters), this will panic at runtime with byte index 120 is not a char boundary.
Root Cause and Impact
The code checks mapped.text.len() > 120 (which is a byte length check) and then slices at byte position 120 with &mapped.text[..120]. For ASCII-only text this works fine, but any multi-byte character spanning the boundary will cause a panic.
The codebase already has a floor_char_boundary helper function (defined at src/main.rs:2910) that safely finds the nearest valid character boundary. The fix is to use it:
let boundary = floor_char_boundary(&mapped.text, 120);
format!("{}…", &mapped.text[..boundary])Impact: The broker process crashes when a relay message containing multi-byte characters is received in listen mode, killing all managed agent sessions.
View 13 additional findings in Devin Review.
Adds a workflow that automatically runs `cargo fmt` on PRs that touch Rust files. If formatting changes are needed, it commits them directly to the PR branch. This ensures Rust code is always properly formatted without manual intervention. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
Devin Review found 1 new potential issue.
🐛 1 issue in files not directly in the diff
🐛 Byte-index slicing of mapped.text panics on multi-byte UTF-8 characters (src/main.rs:1462)
The expression &mapped.text[..120] at src/main.rs:1462 uses byte-based indexing to truncate the relay message text for display. If the 120th byte falls in the middle of a multi-byte UTF-8 character (e.g., emoji, CJK, accented characters), this will panic at runtime with byte index 120 is not a char boundary.
Root Cause and Impact
The code checks mapped.text.len() > 120 (which is a byte length check) and then slices at byte position 120 with &mapped.text[..120]. For ASCII-only text this works fine, but any multi-byte character spanning the boundary will cause a panic.
The codebase already has a floor_char_boundary helper function (defined at src/main.rs:2910) that safely finds the nearest valid character boundary. The fix is to use it:
let boundary = floor_char_boundary(&mapped.text, 120);
format!("{}…", &mapped.text[..boundary])Impact: The broker process crashes when a relay message containing multi-byte characters is received in listen mode, killing all managed agent sessions.
View 15 additional findings in Devin Review.
Fixed potential panic when truncating relay message text for display. The code was using byte-index slicing which can panic if the index falls in the middle of a multi-byte UTF-8 character. Now uses floor_char_boundary helper to find safe character boundary. Reported by: Devin Review Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Fixes from DeepThinker-2 code review: HIGH priority: - Added floor_char_boundary helper to conversation_log.rs - Fixed 3 UTF-8 panic bugs in truncate_text, pad_or_truncate, short_id - Same class of bug as main.rs fix LOW priority: - Updated README.md to reference correct package name @agent-relay/broker-sdk Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR adds a Rust-based broker implementation and TypeScript SDK to the existing Agent Relay codebase. The changes are additive-only with no removals, enabling incremental migration from the existing TypeScript implementation to Rust. The PR includes 21 Rust source files implementing core broker functionality (authentication, WebSocket handling, process management, telemetry), a TypeScript SDK package with protocol client and high-level API facade, CI workflows for Rust testing and formatting, and Rust coding conventions documentation.
Changes:
- Adds complete Rust broker implementation with authentication, WebSocket client, process spawning, message scheduling, and telemetry
- Adds TypeScript SDK (
@agent-relay/broker-sdk) with protocol client, high-level API, and integration tests - Adds GitHub Actions workflows for Rust CI (test, clippy, fmt) and auto-formatting
Reviewed changes
Copilot reviewed 42 out of 45 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| Cargo.toml | Rust project configuration with dependencies |
| src/*.rs (21 files) | Core broker modules: auth, config, spawner, telemetry, WebSocket, etc. |
| packages/sdk-ts/src/*.ts | TypeScript SDK with client, protocol definitions, and high-level API |
| packages/sdk-ts/package.json | SDK package configuration |
| .github/workflows/rust-ci.yml | CI workflow for Rust tests and linting |
| .github/workflows/rust-fmt-fix.yml | Auto-format workflow for Rust code |
| .claude/rules/rust.md | Rust coding conventions and guidelines |
| vitest.config.ts | Excludes sdk-ts from vitest (uses Node.js test runner) |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if (!currentPath.split(path.delimiter).includes(binDir)) { | ||
| env.PATH = `${binDir}${path.delimiter}${currentPath}`; |
There was a problem hiding this comment.
The PATH manipulation assumes relay_send is in the same directory as the agent-relay binary, but the comment mentions "relay_send" which is not present in the added files. If relay_send is a separate binary that hasn't been added yet, this could cause runtime failures when workers try to find it.
| if (!currentPath.split(path.delimiter).includes(binDir)) { | |
| env.PATH = `${binDir}${path.delimiter}${currentPath}`; | |
| const relaySendPath = path.join(binDir, "relay_send"); | |
| if (fs.existsSync(relaySendPath) && !currentPath.split(path.delimiter).includes(binDir)) { | |
| env.PATH = currentPath ? `${binDir}${path.delimiter}${currentPath}` : binDir; |
| use sha2::{Digest, Sha256}; | ||
| use tokio::sync::mpsc; | ||
|
|
||
| const POSTHOG_API_KEY: &str = "phc_2uDu01GtnLABJpVkWw4ri1OgScLU90aEmXmDjufGdqr"; |
There was a problem hiding this comment.
The PostHog API key is hardcoded as a public constant. While this is a write-only key for telemetry ingestion, consider documenting that this is intentional and expected to be public. If this key is exposed and abused, it could result in data pollution or quota exhaustion.
| cmd.arg("wrap").arg(cli); | ||
| for arg in extra_args { | ||
| cmd.arg(arg); | ||
| } |
There was a problem hiding this comment.
The cli and extra_args parameters are passed directly to Command without validation. This could allow command injection if these values come from untrusted input. Consider validating or sanitizing these parameters, especially if they originate from network requests or user input.
| exclude: [ | ||
| '**/node_modules/**', | ||
| '**/dist/**', | ||
| 'packages/sdk-ts/**', // Uses Node.js test runner, not vitest | ||
| ], |
There was a problem hiding this comment.
The packages/sdk-ts/** exclusion is correct for the stated reason (Node.js test runner), but consider verifying that no other TypeScript tests in the repo are being inadvertently excluded. The pattern uses a glob that matches any depth within sdk-ts.
| #[cfg(not(unix))] | ||
| fn set_owner_only_permissions(_path: &Path) -> Result<()> { | ||
| Ok(()) | ||
| } |
There was a problem hiding this comment.
On Windows (non-Unix), credential files are saved without permission restrictions. This could expose API keys and tokens to other users on the same system. Consider using Windows-specific APIs (e.g., file encryption or ACLs) to restrict access on Windows platforms.
| "demo": "node dist/examples/demo.js", | ||
| "ralph": "node dist/examples/ralph-loop.js", | ||
| "example": "node dist/examples/example.js", | ||
| "prepack": "npm run build" |
There was a problem hiding this comment.
The prepack script only runs npm run build, not npm run build:full. This means the binary won't be bundled when users install the package via npm. If the binary is intended to be included in the published package, change prepack to run build:full instead of build.
| "prepack": "npm run build" | |
| "prepack": "npm run build:full" |
…on script Comprehensive plan to migrate CLI from daemon-based TS stack to Rust broker: - Migration plan with 7-layer delivery guarantees exceeding old orchestrator - Wave-based execution plan (9 waves, 36 short-lived agents) - Executable orchestration script using broker-sdk for agent coordination - 14 beads (agent-relay-549 through 562) tracking all work items Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Extract helpers.rs, pty_worker.rs, wrap.rs from main.rs (~1100 lines moved) (done by Worker-W0-Decompose via broker orchestration) - Add atomic state persistence using tempfile + persist pattern - Add flock guard to prevent concurrent broker instances - Update broker-migration script with binaryPath, comms instructions, exit detection - Add tempfile dep and nix "fs" feature to Cargo.toml All 135 tests pass. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…erence The 7 auth tests were failing because the real RELAY_API_KEY env var leaked into mock-server tests, causing 404s against httpmock servers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…harness Fix broken agent→orchestrator communication: remove hardcoded RELAY_AGENT_NAME from .mcp.json so each spawned agent registers with its own identity via env var. Add onWorkerOutput hook to SDK for PTY output scanning, fix waitForExit(0) bug. Rewrite orchestration completion detection with event-driven exits + idle timeout. Wave 1 deliverables: broker-harness.ts, assert-helpers.ts, lifecycle/messaging/ events test suites (6 files, ~1300 lines of integration test infrastructure). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The gate was failing with "cargo: command not found" because ~/.cargo/bin is not in PATH for subshells. Also gracefully skip integration phase check when run-phase.ts hasn't been created yet. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Codex workers need full file paths and concrete code snippets to know what to modify. Previous runs produced no code because prompts said "Read the Lead's plan" without giving actionable file locations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| if workers.has_worker(&payload.to) { | ||
| queue_and_try_delivery_raw( | ||
| workers, | ||
| pending_deliveries, | ||
| &payload.to, | ||
| &event_id, | ||
| &from, | ||
| &payload.to, | ||
| &payload.text, | ||
| payload.thread_id, | ||
| priority, | ||
| delivery_retry_interval(), | ||
| ) | ||
| .await?; | ||
|
|
||
| send_ok( | ||
| out_tx, | ||
| frame.request_id, | ||
| json!({ | ||
| "delivered": true, | ||
| "to": payload.to, | ||
| "event_id": event_id, | ||
| }), | ||
| ) | ||
| .await?; | ||
| } else if let Some(http) = relaycast_http { | ||
| // Target is not a local worker — forward via Relaycast REST API | ||
| let to = payload.to.clone(); | ||
| let eid = event_id.clone(); | ||
| match http.send(&to, &payload.text).await { |
There was a problem hiding this comment.
🟡 Broker send_message only recognizes exact local worker names, unlike other delivery paths (case/@Prefix)
The broker’s send_message local-delivery branch uses workers.has_worker(&payload.to), which only matches an exact worker name key.
Elsewhere, inbound WS direct-target delivery supports case-insensitive matching and @name aliasing via worker_names_for_direct_target().
Root Cause / Impact
In send_message, a request like { to: "@Worker1" } or { to: "worker1" } will not be treated as a local target even if the worker exists, and will instead fall through to Relaycast forwarding (src/main.rs:1639-1664). This is inconsistent with WS-driven direct message delivery (src/main.rs:841-853 uses worker_names_for_direct_target).
Impact: local agent-to-agent messaging via the SDK becomes unexpectedly slower and/or fails if Relaycast credentials/connectivity are missing, despite the worker being local.
Prompt for agents
Make send_message resolve local targets using the same logic as WS inbound delivery.
In src/main.rs:1639-1664, replace the `workers.has_worker(&payload.to)` / single-worker delivery with:
- `let targets = workers.worker_names_for_direct_target(&payload.to, &from);`
- if targets not empty, deliver to each target
This will add support for case-insensitive matching and `@name` syntax, and aligns SDK send_message with existing WS delivery behavior.
Was this helpful? React with 👍 or 👎 to provide feedback.
Wave 4 completed. Verdict: nt�[15;9HREVIEW:PASS�[15;21Hor�[15;24HREVIEW:FAIL�[15;36Hto�[15;39Hstdout.�[18;3H�[2mtab to queue message�[18;62H100% context left�[39m�[49m�[0m�[?25h�[16;3H�[?2026l Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Codex workers need full absolute paths and concrete code snippets to produce output. Updated Waves 5-8 worker, reviewer prompts with exact file locations, struct definitions, and line number hints. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add delivery_queued and delivery_injected events to pty_worker.rs. Fix wrap.rs send_frame errors by replacing with tracing (wrap mode runs as a standalone process without protocol channel access). Update SDK protocol types and add onDeliveryUpdate hook. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
New src/spawner.rs with session leader spawning (setsid), child process tracking, graceful termination (SIGTERM→SIGKILL), reap_exited, and 5 unit tests including session leader verification. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Delivery persistence (Wave 6 gap): - Add save_pending_deliveries/load_pending_deliveries with atomic writes - Load pending deliveries on broker startup for crash recovery - Save pending deliveries periodically in reap tick - Clean up pending.json on graceful shutdown Tilde expansion fix: - Add expandTilde() to SDK client.ts - Expand ~ to home directory before fs.existsSync and spawn calls - Fixes "binary not found" for paths like ~/bin/agent-relay Also fix libc::setsid -> nix::libc::setsid in spawner.rs for CI. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Benchmarks: latency, throughput, reliability, overhead, scale-out Parity tests: broadcast, multi-worker, orch-to-worker, continuity-handoff, stability-soak All parity tests passing, 164 Rust tests passing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| /** | ||
| * Start the broker process, wait for hello_ack. | ||
| * Creates both a low-level client and a high-level facade. | ||
| */ | ||
| async start(): Promise<void> { | ||
| if (this.started) return; | ||
|
|
||
| const clientOpts: AgentRelayClientOptions = { | ||
| binaryPath: this.opts.binaryPath, | ||
| channels: this.opts.channels, | ||
| requestTimeoutMs: this.opts.requestTimeoutMs, | ||
| shutdownTimeoutMs: this.opts.shutdownTimeoutMs, | ||
| env: this.opts.env, | ||
| }; | ||
|
|
||
| // Start the low-level client (spawns broker process) | ||
| this.client = await AgentRelayClient.start(clientOpts); | ||
|
|
||
| // Wire event collection | ||
| this.unsubEvent = this.client.onEvent((event: BrokerEvent) => { | ||
| this.events.push(event); | ||
| for (const listener of this.eventListeners) { | ||
| listener(event); | ||
| } | ||
| }); | ||
|
|
||
| // Create a high-level facade sharing the same binary/options | ||
| this.relay = new AgentRelay({ | ||
| binaryPath: this.opts.binaryPath, | ||
| channels: this.opts.channels, | ||
| requestTimeoutMs: this.opts.requestTimeoutMs, | ||
| shutdownTimeoutMs: this.opts.shutdownTimeoutMs, | ||
| env: this.opts.env, | ||
| }); | ||
|
|
||
| this.started = true; | ||
| } |
There was a problem hiding this comment.
🟡 Integration BrokerHarness can inadvertently start a second broker process (double-lock / port conflicts)
The integration test harness constructs both a low-level AgentRelayClient (which spawns a broker process) and a high-level AgentRelay facade (which can also spawn its own broker process) but does not share a single underlying process between them.
BrokerHarness.start()always startsthis.client = await AgentRelayClient.start(...)(spawns a broker) (/home/ubuntu/repos/repo-0f9455b42bf349aca671d3e83bb52228/tests/integration/broker/utils/broker-harness.ts:75-107).- It then creates
this.relay = new AgentRelay({...})with the samebinaryPathetc. (/home/ubuntu/repos/repo-0f9455b42bf349aca671d3e83bb52228/tests/integration/broker/utils/broker-harness.ts:97-105). If any test later callsharness.relay.*,AgentRelaywill callensureStarted()and spawn a second broker process (packages/sdk-ts/src/relay.ts:267-279).
Actual vs Expected / Impact
Actual:
- Some tests may end up with two broker processes in the same working directory. The Rust broker uses a directory lock (
.agent-relay/broker.lock), so the second broker may fail to start or may cause flaky behavior depending on timing.
Expected:
- The harness should either:
- only use one API (client or facade) per harness instance, or
- allow the facade to reuse the already-started client/process (not currently supported by the facade API).
Impact:
- Flaky integration tests and potential spurious failures that look like broker bugs.
Prompt for agents
In /home/ubuntu/repos/repo-0f9455b42bf349aca671d3e83bb52228/tests/integration/broker/utils/broker-harness.ts, avoid creating both a standalone AgentRelayClient and AgentRelay facade unless you can guarantee only one starts a broker. The simplest fix: remove the AgentRelay facade from BrokerHarness entirely (and update tests to use only AgentRelayClient methods), or lazily construct relay only when needed and document that it must not be used in the same harness instance. Alternatively, enhance packages/sdk-ts/src/relay.ts to accept an existing AgentRelayClient instance (shared stdio) so no second process is spawned, then update BrokerHarness to pass it through.
Was this helpful? React with 👍 or 👎 to provide feedback.
- Delete 7 spec/proposal/migration planning docs (superseded by code) - Delete orphaned activity.rs and throttle.rs (never compiled, stress tests confirm simple ThrottleState is sufficient at 100% success rate) - Delete broken cold-start.ts (uses old relay_inbound event) - Remove pub mod spawner from lib.rs (only used by binary crate) - Remove dead delivery_lifecycle_event_series function - Remove unused request_id field from PendingActivity - Add stress test suite (burst, contention, steady-state, spawn/release) - Zero compiler warnings, 158 tests passing Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- ensureStarted now clears startPromise on failure, allowing retry - send_message response includes targets array matching SDK type expectation - Both local delivery and relaycast publish paths return targets Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add 8 missing BrokerEvent variants to protocol.ts (delivery_verified, delivery_failed, delivery_active, delivery_ack, worker_ready, worker_error, relaycast_published, relaycast_publish_failed) - Make reap_exited non-fatal: log warning and skip cycle instead of crashing entire broker on transient OS error - Resolve pending exitResolvers on shutdown to prevent promise leaks Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fix Clippy lint errors: - src/pty_worker.rs:254 - use is_some() instead of if let Some(_) - src/wrap.rs:490 - use is_some() instead of if let Some(_) This resolves the clippy::redundant-pattern-matching warnings. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
| let body2: Value = res2.json().await?; | ||
| body2 | ||
| .pointer("/data/token") | ||
| .or_else(|| body2.get("token")) | ||
| .and_then(Value::as_str) | ||
| .map(String::from) | ||
| .ok_or_else(|| anyhow::anyhow!("no token in register response"))? |
There was a problem hiding this comment.
🟡 Missing HTTP status check on 409-conflict retry in ensure_token hides real errors
When ensure_token in RelaycastHttpClient gets a 409 Conflict on the first agent registration, it retries with a suffixed name. However, the retry response res2 is never checked for HTTP error status.
Detailed Explanation
If the retry registration also fails (e.g., returns 401 Unauthorized or 500 Internal Server Error), the code at relaycast_ws.rs:291 attempts res2.json().await? on a potentially non-JSON error body. If the body is JSON but has no data.token field, the error message is the misleading "no token in register response" instead of the actual HTTP status code. If the body is not valid JSON (e.g., an HTML error page), the error is an opaque deserialization failure.
Compare this to the success path at line 298 (else if !status.is_success()) which correctly surfaces the HTTP status and response body.
Actual: Retry path on 409 conflict silently tries to parse any response body as a token, producing confusing error messages when the retry itself fails.
Expected: The retry should check res2.status().is_success() and bail with a clear error if it's not, similar to the primary registration path.
Impact: Users debugging Relaycast connectivity issues on the retry path would see misleading errors like "no token in register response" or JSON parse errors instead of the actual HTTP status, making diagnosis harder.
| let body2: Value = res2.json().await?; | |
| body2 | |
| .pointer("/data/token") | |
| .or_else(|| body2.get("token")) | |
| .and_then(Value::as_str) | |
| .map(String::from) | |
| .ok_or_else(|| anyhow::anyhow!("no token in register response"))? | |
| let body2: Value = res2.json().await?; | |
| if !status.is_success() { | |
| anyhow::bail!( | |
| "relaycast fallback register failed ({}): {}", | |
| status, | |
| serde_json::to_string(&body2).unwrap_or_default() | |
| ); | |
| } | |
| body2 | |
| .pointer("/data/token") | |
| .or_else(|| body2.get("token")) | |
| .and_then(Value::as_str) | |
| .map(String::from) | |
| .ok_or_else(|| anyhow::anyhow!("no token in register response"))? |
Was this helpful? React with 👍 or 👎 to provide feedback.
Add unit tests for helpers.rs (30 tests: codex model prompt, gemini action, CLI readiness, terminal query parser, throttle, strip_ansi, format_injection) and main.rs (12 bypass flag tests). Add real CLI integration tests: stress (5 tests), edge cases (7 tests), functionality (8 tests), and extended cli-spawn (4 tests). Add utility helpers: firstAvailableCli, skipUnlessAnyCli, eventsForAgent. Add head-to-head benchmark comparing broker PTY vs relay-pty across latency, throughput, reliability, multi-agent scaling, and feature matrix. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| try { | ||
| await wait; | ||
| } finally { | ||
| clearTimeout(timeout); | ||
| if (this.child) { | ||
| this.child.kill("SIGKILL"); | ||
| } | ||
| } |
There was a problem hiding this comment.
🟡 SIGKILL escalation in SDK client shutdown is dead code — shutdown can hang indefinitely
In the AgentRelayClient.shutdown() method, the SIGKILL sent in the finally block at line 243-244 can never actually fire on a live process because this.child is always undefined by the time the finally block runs.
Root Cause
The exitPromise (line 302-316 in packages/sdk-ts/src/client.ts) calls this.disposeProcessHandles() which sets this.child = undefined before calling resolve(). So when await wait at line 240 completes, this.child is already undefined, making the if (this.child) check at line 243 always false.
This means if the broker process ignores SIGTERM (sent at line 235), there is no escalation to SIGKILL. The await wait at line 240 will hang forever because exitPromise never resolves, and the SIGKILL that was supposed to be the safety net can only run after the promise resolves (i.e., after the process has already exited).
Actual behavior: If the broker process hangs after receiving the shutdown request and SIGTERM, shutdown() blocks indefinitely with no SIGKILL escalation.
Expected behavior: After shutdownTimeoutMs, the process should receive SIGTERM, and if it still hasn't exited after a further grace period, it should be forcefully killed with SIGKILL.
Impact: Callers of shutdown() can hang indefinitely when the broker process is unresponsive to SIGTERM, potentially causing the parent process (SDK consumer) to also hang.
Prompt for agents
In packages/sdk-ts/src/client.ts, the shutdown() method at lines 220-247 needs to be restructured so that SIGKILL is sent as a timed escalation rather than in a finally block that only runs after the process has already exited.
The fix should:
1. After sending SIGTERM (line 235), set a second timeout (e.g. shutdownTimeoutMs * 2 or shutdownTimeoutMs + 2000) that sends SIGKILL to the captured `child` reference if the process still hasn't exited.
2. Both timeouts should be cleared in the finally block.
3. The finally block should still call child.kill('SIGKILL') as a last resort, but use the captured `child` reference (line 231) instead of `this.child` (which gets nulled by disposeProcessHandles).
Example approach:
const killTimeout = setTimeout(() => { child.kill('SIGKILL'); }, this.options.shutdownTimeoutMs + 2000);
try { await wait; } finally { clearTimeout(timeout); clearTimeout(killTimeout); }
Was this helpful? React with 👍 or 👎 to provide feedback.
…cho_in_output pty_worker.rs and wrap.rs import PendingVerification, check_echo_in_output, MAX_VERIFICATION_ATTEMPTS, and VERIFICATION_WINDOW from helpers.rs but they were either missing or gated behind #[cfg(test)]. This caused compilation failure. Also adds stderr logging to head-to-head benchmark for debugging. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| chars.next(); | ||
| if nc.is_ascii_alphabetic() || nc == '@' || nc == '`' { | ||
| break; | ||
| } | ||
| } |
There was a problem hiding this comment.
🟡 strip_ansi missing ~ as CSI terminator causes character consumption past escape sequences
The strip_ansi function in helpers.rs does not recognize ~ (0x7E) as a valid CSI final byte. Per the ECMA-48 standard, CSI sequences terminate with bytes in the range 0x40–0x7E, but the code only checks for is_ascii_alphabetic() || nc == '@' || nc == '\', which misses ~, {, |, }, [, `, ], ^, and _.
Root Cause and Impact
The terminator check at src/helpers.rs:313 is:
if nc.is_ascii_alphabetic() || nc == '@' || nc == '`' {
break;
}For a ~-terminated CSI sequence like \x1b[15~ (F5 key), the parser consumes [, 1, 5, and ~ — but since ~ is not recognized as a terminator, it does not break the loop. The loop continues, consuming the next real character after the escape sequence.
For example, if PTY output contains \x1b[15~Relay message from Alice [evt_1]: hello, the function strips \x1b[15~ plus the R, producing elay message from Alice [evt_1]: hello. This corrupted output is then checked by check_echo_in_output (src/helpers.rs:144), which calls strip_ansi internally — the expected echo "Relay message from Alice [evt_1]: hello" won't match, causing a false-negative verification failure.
Impact: In practice, AI CLI output rarely contains ~-terminated CSI sequences (those are mostly input sequences), so the impact is limited to edge cases. However, if such sequences appear in PTY output, delivery verification (check_echo_in_output), CLI readiness detection (detect_cli_ready), and activity detection could all produce incorrect results.
| chars.next(); | |
| if nc.is_ascii_alphabetic() || nc == '@' || nc == '`' { | |
| break; | |
| } | |
| } | |
| while let Some(&nc) = chars.peek() { | |
| chars.next(); | |
| if nc.is_ascii_alphabetic() | |
| || nc == '@' | |
| || nc == '`' | |
| || nc == '~' | |
| || nc == '{' | |
| || nc == '|' | |
| || nc == '}' | |
| { | |
| break; | |
| } | |
| } |
Was this helpful? React with 👍 or 👎 to provide feedback.
- Gate CLI_READY_TIMEOUT and detect_cli_ready with #[cfg(test)] since they're only used in tests - Collapse identical gemini/aider/default prompt pattern branches into a single else clause (clippy::if_same_then_else) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| async shutdown(): Promise<void> { | ||
| if (this.unsubEvent) { | ||
| this.unsubEvent(); | ||
| this.unsubEvent = undefined; | ||
| } | ||
| if (this.client) { | ||
| await this.client.shutdown(); | ||
| this.client = undefined; | ||
| } | ||
| this.knownAgents.clear(); | ||
| for (const resolve of this.exitResolvers.values()) { | ||
| resolve("released"); | ||
| } | ||
| this.exitResolvers.clear(); | ||
| } |
There was a problem hiding this comment.
🟡 AgentRelay.shutdown() leaks broker child process when startPromise is in-flight
When shutdown() is called on AgentRelay while ensureStarted() is still in-flight (the startPromise exists but hasn't resolved), the method checks this.client which is still undefined, skips the shutdown, and returns. When the startPromise later resolves, it sets this.client and calls wireEvents(c), leaving a running broker child process with no cleanup.
Root Cause
The shutdown() method at packages/sdk-ts/src/relay.ts:230-244 only checks this.client but does not check or await this.startPromise:
async shutdown(): Promise<void> {
if (this.unsubEvent) { ... }
if (this.client) { // <-- undefined while startPromise is pending
await this.client.shutdown();
this.client = undefined;
}
...
}Meanwhile, ensureStarted() at line 271-288 stores the in-flight promise:
this.startPromise = AgentRelayClient.start(this.clientOptions)
.then((c) => {
this.client = c; // sets client AFTER shutdown returned
this.startPromise = undefined;
this.wireEvents(c); // subscribes to events on zombie process
return c;
});Trigger scenario: Any code path that calls shutdown() before ensureStarted() resolves — e.g., error handling during initialization, signal handlers, or concurrent operations.
Impact: The spawned agent-relay broker binary continues running as an orphan child process. The caller believes cleanup is complete, but the process persists until the parent Node.js process exits.
| async shutdown(): Promise<void> { | |
| if (this.unsubEvent) { | |
| this.unsubEvent(); | |
| this.unsubEvent = undefined; | |
| } | |
| if (this.client) { | |
| await this.client.shutdown(); | |
| this.client = undefined; | |
| } | |
| this.knownAgents.clear(); | |
| for (const resolve of this.exitResolvers.values()) { | |
| resolve("released"); | |
| } | |
| this.exitResolvers.clear(); | |
| } | |
| async shutdown(): Promise<void> { | |
| // If the client is still starting, await (or cancel) the in-flight start | |
| // before proceeding with shutdown to avoid leaking the broker process. | |
| if (this.startPromise) { | |
| try { | |
| await this.startPromise; | |
| } catch { | |
| // Start failed — nothing to shut down. | |
| } | |
| } | |
| if (this.unsubEvent) { | |
| this.unsubEvent(); | |
| this.unsubEvent = undefined; | |
| } | |
| if (this.client) { | |
| await this.client.shutdown(); | |
| this.client = undefined; | |
| } | |
| this.knownAgents.clear(); | |
| for (const resolve of this.exitResolvers.values()) { | |
| resolve("released"); | |
| } | |
| this.exitResolvers.clear(); | |
| } | |
Was this helpful? React with 👍 or 👎 to provide feedback.
Summary
Add relay-broker Rust implementation and TypeScript SDK alongside existing codebase. This is an additive-only change for incremental migration - no existing packages or code removed.
This PR extracts only the additions from #412, enabling safe merge while preserving all existing functionality.
What's Added
What's NOT Changed
Migration Strategy
Test Plan
cargo test,cargo clippy)🤖 Generated with Claude Code
Related: #412 (original migration PR - this is the safe additive subset)