feat(broker): snapshot module + dump-pty debug command (#869)#870
Conversation
Adds `src/snapshot.rs` with `Snapshot::capture` / `to_plain` / `to_ansi`,
exposes it over `GET /api/spawned/{name}/snapshot?format=plain|ansi`, and
ships an `agent-relay-broker dump-pty <name>` admin CLI that wraps the
route. Renderers walk the existing alacritty VT grid (#836) once under
the term lock, then drop it — captured state is self-contained so the PTY
reader thread keeps advancing while callers render.
The HTTP request flows broker → PTY worker subprocess via a new
`snapshot_pty` / `snapshot_response` frame pair keyed by request_id, with
a 5s timeout sweep in the broker's reap_tick so a crashed worker can't
hang the HTTP oneshot. The `ansi` payload is base64 because the bytes
contain control characters; `dump-pty --format ansi` decodes and writes
raw bytes to stdout for terminal re-render.
12 new tests (8 snapshot unit + 4 HTTP route), 600 total pass (was 588).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughImplements snapshot capture and dual rendering (plain text and ANSI) of PTY session screens via a new ChangesSnapshot Capture, Rendering, and API Integration
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.trajectories/completed/2026-05/traj_cbmwd07phhm2.json:
- Line 75: The committed trajectory metadata contains a machine-local absolute
path in the JSON key "projectId", which leaks local environment details; update
the value stored under "projectId" to a portable identifier (e.g., repository
name, git remote/commit hash, or a relative path/slug) and remove any /Users/…
absolute paths, and change the code that generates this metadata to emit the
portable identifier instead of the local absolute path (ensure any
serializer/metadata writer that sets "projectId" is updated and re-run to
replace the current value).
In `@src/main.rs`:
- Around line 867-870: The code currently sets search_roots to probe both
cwd.join(".agent-relay") and cwd.clone(), which allows autodiscovery to read
<cwd>/connection.json; update the match that builds search_roots (the variable
named search_roots in main.rs, using state_dir and cwd) to only include the
dedicated broker metadata directory (cwd.join(".agent-relay")) when state_dir is
None so we no longer probe the repository root for connection.json and therefore
won't accidentally read unrelated config files or leak broker credentials.
- Around line 2164-2193: The Snapshot handler currently sends snapshot_pty to
all workers but run_headless_worker does not handle that RPC, so requests to
headless workers time out; before calling workers.send_to_worker in
ListenApiRequest::Snapshot check the worker's capabilities (e.g. via an existing
method like workers.has_worker + a capability/query method such as
workers.is_headless / workers.supports_pty / workers.get_worker_kind) and
immediately reply with an error (e.g. "unsupported_runtime: snapshot_pty not
supported for headless worker") when the target is a headless/non-PTY worker
instead of forwarding the request and inserting into pending_snapshots; keep the
current error path for missing workers and only call workers.send_to_worker when
the worker actually supports PTY snapshots.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: a4fb9eca-b1f1-4c56-9e5e-37f3343f8acd
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (13)
.trajectories/completed/2026-05/traj_cbmwd07phhm2.json.trajectories/completed/2026-05/traj_cbmwd07phhm2.md.trajectories/index.jsonCargo.tomldocs/reference-broker-api.mdsrc/lib.rssrc/listen_api.rssrc/main.rssrc/pty.rssrc/pty_worker.rssrc/snapshot.rsweb/content/docs/reference-broker-api.mdxweb/lib/docs-nav.ts
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0b0518a2da
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| return Ok(BrokerConnection { | ||
| base_url: url.trim_end_matches('/').to_string(), | ||
| api_key: explicit_api_key | ||
| .map(ToString::to_string) | ||
| .or_else(|| std::env::var("RELAY_BROKER_API_KEY").ok()) | ||
| .filter(|value| !value.trim().is_empty()), |
There was a problem hiding this comment.
Preserve API key fallback when broker URL is overridden
When --broker-url is provided, this early return skips reading connection.json, so dump-pty can only authenticate via --api-key or RELAY_BROKER_API_KEY. In the common case where the broker key is only in .agent-relay/connection.json, dump-pty --broker-url ... sends an unauthenticated request and fails with 401. Keep the URL override, but still allow API-key fallback from connection.json to match the documented fallback behavior.
Useful? React with 👍 / 👎.
| let _ = reply.send(Err( | ||
| "snapshot_timeout: worker did not respond in time".into(), | ||
| )); |
There was a problem hiding this comment.
Return server error for timed-out snapshot requests
Timed-out snapshot requests are tagged as snapshot_timeout, but listen_api_snapshot classifies unknown prefixes as a client error (400), so worker crashes/timeouts are surfaced as bad requests instead of server-side failures (e.g., 5xx/504). This makes callers mis-handle transient broker/worker faults as user input problems; emit an error code that maps to server failure or extend classification for snapshot_timeout.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
9 issues found across 14 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name=".trajectories/completed/2026-05/traj_cbmwd07phhm2.json">
<violation number="1" location=".trajectories/completed/2026-05/traj_cbmwd07phhm2.json:75">
P2: The `projectId` field leaks a developer's local filesystem path (including username) into the repository. Consider using a relative path or an opaque project identifier instead.</violation>
</file>
<file name=".trajectories/index.json">
<violation number="1" location=".trajectories/index.json:586">
P2: Path points into a temporary `.claude/worktrees/` directory that won't exist after the worktree is cleaned up. Use a relative path (like some other entries) or an absolute path to the main project's `.trajectories/` directory so the trajectory file remains accessible.</violation>
</file>
<file name="src/main.rs">
<violation number="1" location="src/main.rs:844">
P2: When `--broker-url` is provided, this early return skips reading `connection.json`, so the API key stored there is never loaded. In the common case where the broker key lives only in `.agent-relay/connection.json`, `dump-pty --broker-url http://...` sends an unauthenticated request and gets a 401. Consider still reading `connection.json` for the API key even when the URL is overridden.</violation>
<violation number="2" location="src/main.rs:870">
P2: Including `cwd.clone()` in `search_roots` means any file named `connection.json` in the current working directory — even one unrelated to the broker — can be picked up as broker metadata. This could send the snapshot request (and the broker API key) to a wrong URL. Restrict autodiscovery to `cwd.join(".agent-relay")` only.</violation>
<violation number="3" location="src/main.rs:1361">
P2: Comment claims snapshot entries are cleaned up in the worker-exit handler, but no such cleanup exists there. The timeout sweep in `reap_tick` is the only mechanism that removes stale entries. Either add explicit cleanup in the worker-exit path (which would give faster feedback to the HTTP caller) or correct the comment to say entries are cleaned by the timeout sweep.</violation>
<violation number="4" location="src/main.rs:2170">
P2: Snapshot requests for headless (non-PTY) workers are not rejected up front. Since `run_headless_worker` doesn't handle `snapshot_pty` frames, it returns `worker_error` which is never mapped back to `pending_snapshots`. The request silently waits ~5 seconds until the timeout sweep fires, then returns a misleading `snapshot_timeout` error. Check the worker's runtime type here and return an immediate descriptive error for non-PTY workers.</violation>
<violation number="5" location="src/main.rs:3510">
P2: Timed-out snapshot requests emit `"snapshot_timeout: ..."` which `classify_error` doesn't recognize, so it falls through to a 400 Bad Request. A server-side timeout (worker died or is unresponsive) should map to a 5xx status (e.g. 504 Gateway Timeout), not a client error. Either add `"snapshot_timeout"` to the `classify_error` mapping or use an existing prefix that maps to a server error.</violation>
</file>
<file name="src/snapshot.rs">
<violation number="1" location="src/snapshot.rs:175">
P2: Wide character spacer cells are not skipped, causing misalignment when the ANSI output is replayed. When a wide char (CJK, emoji) is present, alacritty stores a spacer cell (space + `WIDE_CHAR_SPACER` flag) after it. Outputting that space as a real character will shift all subsequent content by one column. The spacer should be skipped in the render loop (similar to how alacritty's own `line_content` method works).</violation>
<violation number="2" location="src/snapshot.rs:352">
P2: `named_color_sgr` always returns `Some(...)` — the `Option` return type and the caller's `None` handling are dead code. For semantic defaults (Foreground, Background, Cursor, etc.) return `None` so the caller skips emission, matching the documented intent and avoiding redundant `;39`/`;49` after a reset.</violation>
</file>
Tip: instead of fixing issues one by one fix them all with cubic
Re-trigger cubic
| request_id = %req_id, | ||
| "snapshot request timed out before worker responded" | ||
| ); | ||
| let _ = reply.send(Err( |
There was a problem hiding this comment.
P2: Timed-out snapshot requests emit "snapshot_timeout: ..." which classify_error doesn't recognize, so it falls through to a 400 Bad Request. A server-side timeout (worker died or is unresponsive) should map to a 5xx status (e.g. 504 Gateway Timeout), not a client error. Either add "snapshot_timeout" to the classify_error mapping or use an existing prefix that maps to a server error.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/main.rs, line 3510:
<comment>Timed-out snapshot requests emit `"snapshot_timeout: ..."` which `classify_error` doesn't recognize, so it falls through to a 400 Bad Request. A server-side timeout (worker died or is unresponsive) should map to a 5xx status (e.g. 504 Gateway Timeout), not a client error. Either add `"snapshot_timeout"` to the `classify_error` mapping or use an existing prefix that maps to a server error.</comment>
<file context>
@@ -3197,6 +3484,35 @@ async fn run_init(cmd: InitCommand, telemetry: TelemetryClient) -> Result<()> {
+ request_id = %req_id,
+ "snapshot request timed out before worker responded"
+ );
+ let _ = reply.send(Err(
+ "snapshot_timeout: worker did not respond in time".into(),
+ ));
</file context>
- Reject `snapshot_pty` for headless workers up front instead of waiting on the 5 s timeout sweep and returning a misleading `snapshot_timeout`. Maps to HTTP 409 via a new `unsupported_runtime` error class. - Map `snapshot_timeout` to 504 Gateway Timeout in `classify_error` instead of falling through to 400 Bad Request — worker stalls are a server-side fault, not bad input. - Restrict `dump-pty` broker-connection autodiscovery to `<cwd>/.agent-relay/connection.json`. Dropping the bare-cwd probe prevents an unrelated `connection.json` in a repo root from silently redirecting the request and leaking the broker API key. - Still pull the API key out of `connection.json` when only `--broker-url` / `RELAY_BROKER_URL` is supplied, so users don't have to repeat `--api-key` on every dump-pty invocation. - Fix `snapshot::to_ansi` to skip `WIDE_CHAR_SPACER` cells; emitting the placeholder space pushed every subsequent column one to the right on replay (CJK / emoji round-trip). - Make `named_color_sgr` actually return `None` for semantic defaults (Foreground / Background / Cursor / Bright|DimForeground) — the surrounding reset already covers them, and avoiding the redundant `;39` / `;49` shrinks the byte stream and removes dead `None`-handling. - Fix the misleading comment claiming the worker-exit handler reaps `pending_snapshots` (the deadline sweep in `reap_tick` is the only cleanup; reference issue #871 for the planned generalisation). - Strip local absolute paths (`/Users/will/.../worktrees/…`) leaked into the new trajectory JSON. - Add route tests for the 409 / 504 mappings and a wide-char round-trip + `named_color_sgr` defaults test for snapshot. `cargo fmt` / `cargo clippy --all-targets --all-features -- -D warnings` / `cargo test` all clean; broker suite now 604 passed (was 600). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Pushed
Docs ( CI
|
|
To use Codex here, create an environment for this repo. |
|
The Preview Web (SST) CloudFront cap is clearly an infra-level issue unrelated to this PR. Everything else is green. 🐇 ✨ ✅ Actions performedComments resolved. Approval is disabled; enable |
There was a problem hiding this comment.
1 issue found across 7 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="src/main.rs">
<violation number="1" location="src/main.rs:854">
P2: The `current_dir().ok()?` early-returns `None` from the closure even when `state_dir` is `Some(...)` and `cwd` isn't needed. Move the `current_dir()` call into the `None` arm so that an explicit `state_dir` isn't silently ignored when cwd is unavailable.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
Fix all with cubic | Re-trigger cubic
| let cwd = std::env::current_dir().ok()?; | ||
| let roots: Vec<PathBuf> = match state_dir { | ||
| Some(dir) => vec![dir.to_path_buf()], | ||
| None => vec![cwd.join(".agent-relay")], | ||
| }; |
There was a problem hiding this comment.
P2: The current_dir().ok()? early-returns None from the closure even when state_dir is Some(...) and cwd isn't needed. Move the current_dir() call into the None arm so that an explicit state_dir isn't silently ignored when cwd is unavailable.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/main.rs, line 854:
<comment>The `current_dir().ok()?` early-returns `None` from the closure even when `state_dir` is `Some(...)` and `cwd` isn't needed. Move the `current_dir()` call into the `None` arm so that an explicit `state_dir` isn't silently ignored when cwd is unavailable.</comment>
<file context>
@@ -832,22 +832,58 @@ struct BrokerConnection {
+ // pull out a stored `api_key`. Lets `--broker-url` reuse the broker's
+ // saved key when the env var and `--api-key` are both unset.
+ let api_key_from_connection_file = || -> Option<String> {
+ let cwd = std::env::current_dir().ok()?;
+ let roots: Vec<PathBuf> = match state_dir {
+ Some(dir) => vec![dir.to_path_buf()],
</file context>
| let cwd = std::env::current_dir().ok()?; | |
| let roots: Vec<PathBuf> = match state_dir { | |
| Some(dir) => vec![dir.to_path_buf()], | |
| None => vec![cwd.join(".agent-relay")], | |
| }; | |
| let roots: Vec<PathBuf> = match state_dir { | |
| Some(dir) => vec![dir.to_path_buf()], | |
| None => { | |
| let cwd = std::env::current_dir().ok()?; | |
| vec![cwd.join(".agent-relay")] | |
| } | |
| }; |
`/docs/*.md` is a legacy mirror of the public site that's gone stale — only 5 pages survive there vs 40+ in `web/content/docs/*.mdx`. New pages don't belong in it. The `.claude/rules/docs-sync.md` rule that asked for the mirror is being updated separately. This drops the `docs/reference-broker-api.md` file PR #870 added; the canonical reference at `web/content/docs/reference-broker-api.mdx` is untouched and already covers the snapshot endpoint. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous version of `web/content/docs/reference-broker-api.mdx` on this branch only contained the new Snapshots section — it assumed the full broker API reference (which exists on PR #865's branch as 247 lines covering auth, routes, /ws, error envelope) was already on main. It isn't. If this PR merges first, the published docs page would have only a header + the snapshot section, missing the rest of the API surface. Folds PR #865's full reference content in and slots the snapshot section under "Routes" between "PTY interaction" and "Event stream" (promoted from `##` to `###` to match the section's nesting). Adds a snapshot step to the end-to-end worked example so the new endpoint is exercised there too. Also brings the `.claude/rules/docs-sync.md` rewrite from PR #865 onto this branch — the old "mirror to /docs/*.md" rule was what sent the first version of this doc down the wrong path. Either merge order now works: if PR #865 lands first there's a trivial conflict on the duplicated base content; if this PR lands first the doc is already complete. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@web/content/docs/reference-broker-api.mdx`:
- Line 230: Update the docs for the `--state-dir` flag: remove the “(or the
current directory)” phrase and clarify that broker discovery looks for
`.agent-relay/connection.json` under the working directory by default unless an
explicit `--state-dir` is provided; update the table entry that currently reads
"`--state-dir` | Directory containing `connection.json` when discovering the
broker. Defaults to `.agent-relay/` (or the current directory).`" to instead
state the default as `.agent-relay/connection.json` under the working directory
and mention that providing `--state-dir` overrides this behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 7eb9ab3e-8387-48f1-af62-0076d9120bb2
📒 Files selected for processing (2)
.claude/rules/docs-sync.mdweb/content/docs/reference-broker-api.mdx
| | `--format` | `plain` (default) or `ansi`. `plain` writes the rendered screen as UTF-8; `ansi` writes the decoded reproduction bytes. | | ||
| | `--broker-url` | Override the broker base URL. Falls back to `RELAY_BROKER_URL`, then to `.agent-relay/connection.json` in the current directory. | | ||
| | `--api-key` | Override the API key. Falls back to `RELAY_BROKER_API_KEY`, then to the value in `.agent-relay/connection.json`. | | ||
| | `--state-dir` | Directory containing `connection.json` when discovering the broker. Defaults to `.agent-relay/` (or the current directory). | |
There was a problem hiding this comment.
Clarify --state-dir default to avoid implying bare-CWD probing.
Line 230 currently says the default is .agent-relay/ “(or the current directory)”, which conflicts with the tightened autodiscovery behavior. Please remove the “or the current directory” phrasing and state that discovery uses .agent-relay/connection.json under the working directory unless --state-dir is explicitly provided.
Suggested doc tweak
-| `--state-dir` | Directory containing `connection.json` when discovering the broker. Defaults to `.agent-relay/` (or the current directory). |
+| `--state-dir` | Directory containing `connection.json` when discovering the broker. Defaults to `.agent-relay/` under the current working directory. |📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| | `--state-dir` | Directory containing `connection.json` when discovering the broker. Defaults to `.agent-relay/` (or the current directory). | | |
| | `--state-dir` | Directory containing `connection.json` when discovering the broker. Defaults to `.agent-relay/` under the current working directory. | |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@web/content/docs/reference-broker-api.mdx` at line 230, Update the docs for
the `--state-dir` flag: remove the “(or the current directory)” phrase and
clarify that broker discovery looks for `.agent-relay/connection.json` under the
working directory by default unless an explicit `--state-dir` is provided;
update the table entry that currently reads "`--state-dir` | Directory
containing `connection.json` when discovering the broker. Defaults to
`.agent-relay/` (or the current directory).`" to instead state the default as
`.agent-relay/connection.json` under the working directory and mention that
providing `--state-dir` overrides this behavior.
…p-pty # Conflicts: # web/content/docs/reference-broker-api.mdx # web/lib/docs-nav.ts
Summary
Adds a self-contained PTY screen snapshot for running workers and exposes
it three ways:
Snapshot::capture(&pty)— copies the alacritty VT grid out from underthe term lock, then renders to either plain UTF-8 (
to_plain) or anANSI reproduction byte stream (
to_ansi). The render methods runagainst owned data so they cannot block the PTY reader thread.
GET /api/spawned/{name}/snapshot?format=plain|ansi— the brokerproxies the request to the worker subprocess via a new
snapshot_pty→
snapshot_responseframe pair, correlates the response byrequest_id, and returns{ rows, cols, cursor: [r, c], screen }tothe caller. ANSI bytes are base64-encoded because they contain control
characters. A 5s timeout sweep in the broker's existing
reap_tickprotects against a worker dying mid-request.
agent-relay-broker dump-pty <name> [--format plain|ansi]— admin CLIthat discovers the broker from
.agent-relay/connection.json(or--broker-url/RELAY_BROKER_URL) and prints the rendered screen.Resolves #869.
Test plan
cargo fmt --checkcleancargo clippy --all-targetsclean (no new warnings)cargo test— 600 pass (was 588), 0 failuresSGR, ANSI round-trip through a fresh
Term, empty grid, livePtySession::capture)invalid-format 400 without broker call, agent_not_found → 404)
dump-pty --helprenders correctlyNotes / open questions
to_ansiv1 supports foreground / background named + 256-palette +truecolor + bold + reverse + underline. Italic, strikeout, dim,
hyperlinks, undercurl colour, and zerowidth chars are deliberately
dropped — round-trip fidelity is preserved for the common case (the
test asserts a fresh
Termends up with an identical grid).issue's test plan is punted: the broker HTTP server isn't easy to boot
in-process (it owns the broker loop), and the existing unit + route
test surface already covers each end of the JSON-frame contract.
🤖 Generated with Claude Code