feat(telemetry): deep CLI instrumentation across all commands#754
feat(telemetry): deep CLI instrumentation across all commands#754willwashburn merged 13 commits intomainfrom
Conversation
Previously only a handful of interactive commands (up/down/status/version)
emitted a minimal `cli_command_run` event. This leaves the most important
product surfaces — workflow runs, cloud auth, provider auth, swarm — entirely
invisible in telemetry.
Changes:
- packages/telemetry: expand the event schema with cli_command_complete,
workflow_run, cloud_auth, cloud_workflow_run, provider_auth, setup_init,
swarm_run, bridge_spawn, broker_start_failed. Extend cli_command_run with
flags_used (names only), is_tty, is_ci.
- src/cli/bootstrap.ts: drop the interactiveCommands gate so telemetry fires
for every command. Install global Commander preAction/postAction hooks that
emit cli_command_run + cli_command_complete with full command path (e.g.
"cloud login" not just "cloud"), duration, success, exit_code, and
error_class. Switch to parseAsync and await shutdown() to flush PostHog's
queue before exit. Add process.on('exit') + beforeExit safety nets for
commands that hard-exit mid-flight.
- Domain events at high-signal call sites: cloud login/logout/whoami/connect
(cloud_auth), cloud run (cloud_workflow_run), auth <provider> (provider_auth),
run <file> (workflow_run — all paths including yaml soft-fail and script),
init (setup_init), swarm (swarm_run), bridge (bridge_spawn).
- Privacy: flag names only (never values), no run IDs, task text, paths, or
URLs. Errors are reported as error_class (constructor name), never the
message.
Verified: tsc --noEmit clean, full vitest suite 732/732 passing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| await fetch(revokeUrl, { | ||
| method: 'POST', | ||
| headers: { 'content-type': 'application/json' }, | ||
| body: JSON.stringify({ token: auth.refreshToken }), |
Events previously carried a single `agent_relay_version` property that — in
the installed layout — actually resolved to the `@agent-relay/telemetry`
package's own version, not the CLI's or SDK's. That made it impossible to
correlate behavior with specific CLI/SDK/broker releases.
Changes:
- `@agent-relay/telemetry`: `initTelemetry({ cliVersion, sdkVersion,
brokerVersion })` now takes explicit versions. `CommonProperties` gains
optional `cli_version`, `sdk_version`, `broker_version` fields.
`agent_relay_version` is kept as a back-compat alias — it mirrors whichever
component actually emitted the event (CLI or broker).
- `src/cli/bootstrap.ts`: resolves `VERSION` (CLI) and the bundled
`@agent-relay/sdk` version via `createRequire(...).resolve('@agent-relay/sdk/package.json')`,
passes both to `initTelemetry()`, and exports them on `process.env` as
`AGENT_RELAY_CLI_VERSION` / `AGENT_RELAY_SDK_VERSION` so child processes
(the Rust broker, dashboard server) inherit them automatically.
- `src/telemetry.rs` (Rust broker): attaches `broker_version` (from
`CARGO_PKG_VERSION`) plus `cli_version` / `sdk_version` read from the env
vars above, and `os_version` via best-effort `uname -r`. Keeps
`agent_relay_version` = `broker_version` for back-compat. All resolution
is infallible and the sender loop is unchanged.
Event-payload harmonization (e.g. adding `spawn_source`, `has_task`,
`is_shadow` to Rust's `AgentSpawn`) is intentionally deferred — it requires
touching multiple call sites in main.rs / wrap.rs and deserves its own PR.
Verified: `npx tsc --noEmit` clean; `npx vitest run src/cli` 190/190; `cargo
check` clean; `cargo test --lib` 228/228 including a new
`env_nonempty_handles_missing_empty_and_whitespace` test.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rust broker events were missing the richer properties we added to the
TypeScript schema — most notably `spawn_source`, `has_task`, `is_shadow` on
`agent_spawn` and `release_source` on `agent_release`. Without them the only
way to distinguish "user spawned an agent from the dashboard" from "another
agent spawned this one via the broker command channel" was to infer from the
`cli` string, which loses fidelity.
Changes:
- `src/telemetry.rs`: add an `ActionSource` enum (mirrors TS `ActionSource`:
`human_cli` / `human_dashboard` / `agent` / `protocol`) that serializes to
the same snake_case strings. Extend `TelemetryEvent::AgentSpawn` with
`spawn_source` / `has_task` / `is_shadow`; extend `TelemetryEvent::AgentRelease`
with `release_source`. The existing `release_reason` string is retained for
continuity with historical events — it stays a broker-local category
(`ws_command`, `relaycast_release`), while `release_source` is the
TS-aligned product view of who initiated the release.
- `src/main.rs` — three `AgentSpawn` call sites + one `AgentRelease`:
- HTTP API dashboard path (1328): `HumanDashboard`; `has_task` from
`effective_task`; `is_shadow` from `spec_for_state.shadow_of/shadow_mode`.
- Two relaycast-driven WS paths (2249, 2430): `Protocol`; `has_task` from
`effective_task`; `is_shadow: false` (both hardcode shadow fields None).
- Relaycast release (2077): `Protocol`.
- `src/wrap.rs` — AgentSpawn from the wrap command channel (942) is always
`Agent` (another agent invoking a spawn through the broker); the release
site (979) picks `HumanCli` vs `Agent` based on the existing
`sender_is_human` detection.
- Tests: new `action_source_serializes_to_snake_case_strings`,
`agent_spawn_properties_include_new_fields`,
`agent_release_properties_include_release_source`. Existing
`event_names_are_snake_case` updated to construct the new variants. All
231 broker unit tests pass.
Verified: `cargo check`, `cargo test --lib`, `cargo fmt`, `cargo build` all
clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Expands telemetry coverage across the CLI and broker so product-critical commands (run/auth/cloud/swarm/bridge/setup) emit consistent PostHog events, with improved version/context tagging and lifecycle attribution.
Changes:
- Add global Commander hooks to emit
cli_command_run/cli_command_completefor all CLI commands, and flush telemetry on orderly CLI completion. - Introduce new high-signal domain events (
workflow_run,cloud_auth,cloud_workflow_run,provider_auth,setup_init,swarm_run,bridge_spawn) at key command call sites. - Extend broker telemetry schema with richer agent spawn/release attribution (
ActionSource,has_task,is_shadow) and common version/os tagging.
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| src/wrap.rs | Add spawn/release attribution fields when broker handles agent-originated wrap actions. |
| src/telemetry.rs | Extend broker telemetry schema (ActionSource + new fields), add common version/os tagging helpers, and update serialization/tests. |
| src/main.rs | Populate new broker telemetry fields for agent spawn/release events. |
| src/cli/lib/telemetry-helpers.ts | New shared helper for privacy-safe error_class extraction. |
| src/cli/index.ts | Handle async runCli() and ensure top-level failures exit non-zero. |
| src/cli/entrypoint.test.ts | Update mocks/tests for new telemetry shutdown surface and async runCli. |
| src/cli/commands/swarm.ts | Emit swarm_run telemetry for dry-run and real broker execution. |
| src/cli/commands/setup.ts | Emit setup_init and workflow_run telemetry for init + workflow runner paths. |
| src/cli/commands/core.ts | Emit bridge_spawn telemetry when starting the multi-project bridge. |
| src/cli/commands/cloud.ts | Emit cloud_auth + cloud_workflow_run telemetry with success/duration/error_class. |
| src/cli/commands/auth.ts | Emit provider_auth telemetry with provider normalization + error_class. |
| src/cli/bootstrap.ts | Install global command run/complete hooks, propagate version env vars, switch to parseAsync, and add telemetry shutdown. |
| packages/telemetry/src/index.ts | Export new event types and InitTelemetryOptions. |
| packages/telemetry/src/events.ts | Extend event schema/types with new CLI/domain events and common version fields. |
| packages/telemetry/src/client.ts | Allow callers to pass explicit CLI/SDK/broker versions into common properties. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ctx.completed = true; | ||
| track('cli_command_complete', { | ||
| command_name: ctx.name, | ||
| success: true, | ||
| duration_ms: Date.now() - ctx.startedAt, |
| * Privacy rules for authors: | ||
| * - Never capture argument values (file paths, run IDs, task text, tokens, URLs). | ||
| * - Flag NAMES are fine; flag VALUES are not. | ||
| * - When an event can fail, prefer `error_class` (the Error constructor name) | ||
| * over `error_message` so we don't leak user content or paths. |
| emit({ success: false, errorClass: 'WorkflowNotCompleted' }); | ||
| deps.exit(1); |
| deps.exit(1); | ||
| } |
| function errorClassName(err: unknown): string { | ||
| if (err instanceof Error) return err.constructor.name; | ||
| if (err && typeof err === 'object') { | ||
| const ctor = (err as { constructor?: { name?: string } }).constructor; | ||
| return ctor?.name || 'Object'; | ||
| } | ||
| return typeof err; | ||
| } |
| if let Some(ref v) = self.os_version { | ||
| obj.insert("os_version".to_string(), json!(v)); | ||
| } |
| process.exit(0); | ||
| } |
| process.exit(exitCode); | ||
| } |
| deps.exit(1); | ||
| } |
| deps.exit(1); | ||
| } catch (err: any) { |
Addresses PR review feedback from Devin and Copilot.
Primary bug: every command that went through `deps.exit(code)` or called
`process.exit(code)` directly (swarm dry-run, swarm real, setup.ts yaml
workflow soft-failure, setup.ts yaml-with-resume soft-failure, setup.ts
catch branch, setup.ts unsupported file type) terminated Node synchronously
and lost the PostHog events we had just emitted. `shutdownTelemetry()` in
`runCli()` never got a chance to run because `process.exit()` bypasses the
pending promise, and `beforeExit` doesn't fire on hard exits.
Fix: replace raw `process.exit(code)` in command actions with a sanctioned
`CliExit` path.
- `src/cli/lib/exit.ts` (new) — exports `CliExit extends Error` and a shared
`defaultExit(code)` that throws it. All 8 command modules that had a local
`function defaultExit(code) { process.exit(code) }` now import this.
- `src/cli/bootstrap.ts` — `runCli()` catches `CliExit`, emits
`cli_command_complete` with the real `exit_code`, awaits
`shutdownTelemetry()`, then calls `process.exit(err.code)` for real. The
`postAction` hook now reads `process.exitCode ?? 0` so commands that
signal failure via `process.exitCode` (without throwing) report
`success: false` and carry the correct `exit_code`. Removed the duplicate
`errorClassName` helper — bootstrap now imports from the shared
`lib/telemetry-helpers.ts`.
- `src/cli/commands/swarm.ts` — the two `process.exit(code)` sites no longer
skip the flush. Success path (`exitCode === 0`) returns; nonzero throws
`new CliExit(exitCode)` so `runCli()` can drain PostHog's queue first.
- `src/cli/commands/{agent-management,auth,cloud,core,messaging,monitoring,
on,setup}.ts` — each replaces its local `function defaultExit` with an
import from `../lib/exit.js`. DI contract unchanged (tests still pass
their own `exit` mocks).
Schema corrections (Copilot feedback):
- `packages/telemetry/src/events.ts`:
- `ReleaseReason` is now `string` with a JSDoc listing known canonical
values (`explicit` / `crash` / `timeout` / `shutdown`) and broker-local
categories (`ws_command`, `relaycast_release`). The prior enum
contradicted what the broker actually emits.
- `CommonProperties.os_version` is now optional, matching the Rust
broker's behavior of omitting it when `uname -r` fails.
- Header docstring clarifies that `error_class` accepts both real Error
constructor names and stable synthetic category tags (e.g.
`'WorkflowNotCompleted'` from the YAML workflow non-completed path),
unified under PascalCase.
Not addressed: github-advanced-security's CodeQL finding on cloud.ts:199
(File data in outbound network request). This is pre-existing behavior —
the logout flow reads the user's own stored auth token and sends it to the
user's own API endpoint to revoke it. Not a regression from this PR.
Verified: `npx tsc --noEmit` clean; `npx vitest run src/cli` 190/190;
`cargo test --lib telemetry` 10/10.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Thanks for the reviews — addressed in Resolved
Not fixed
Verified: |
Add a header banner image and center-aligned badge links to the README. Introduces readme-banner.png and updates badge markup to include npm version, CI tests, and license badges for clearer project metadata.
Review surfaced two latent bugs introduced by switching `defaultExit` from
`process.exit(code)` to `throw new CliExit(code)` in the previous commit.
1. **Async signal handlers leaked unhandled rejections** — the four
`deps.exit(0)` / `deps.exit(130)` sites in `broker-lifecycle.ts` run
inside `deps.onSignal('SIGINT'|'SIGTERM', async () => { ... })`. Node
invokes signal handlers without awaiting their returned promise, so the
thrown `CliExit` became an unhandled rejection and Node 15+ would
override the intended exit code with 1. In practice: Ctrl+C on
`agent-relay up` used to exit 0, now would exit 1 — a real regression
for scripts and CI pipelines checking exit codes.
Fix: the default `onSignal` implementation in `core.ts` now wraps the
handler invocation. If the wrapper catches a `CliExit`, it flushes
telemetry best-effort and calls the real `process.exit(err.code)`.
Anything else gets logged to stderr and exits 1. The DI contract is
unchanged — tests that supply their own `deps.onSignal` + `deps.exit`
still see their own `ExitSignal` rejection (`core.test.ts` SIGINT test
still expects and receives `{ code: 130 }`).
2. **setup.ts run-command catch swallowed CliExit** — the outer
`catch (err: any)` at the end of the `run` action now sits downstream
of three `deps.exit(1)` sites (lines ~748, 779, 796). Previously
`deps.exit` killed the process, so the catch only saw real errors. Now
`deps.exit` throws `CliExit`; the catch printed the internal message
(`"Error: cli-exit:1"`) and clobbered `error_class` on the outgoing
`workflow_run` event with `'CliExit'`.
Fix: re-throw `CliExit` at the top of the catch so it bubbles straight
through to `runCli()`.
3. Belt-and-braces for other tracked call sites:
`src/cli/lib/telemetry-helpers.ts` `errorClassName(err)` now returns
`undefined` for `CliExit` instances. The existing
`...(errorClass ? { error_class } : {})` pattern in cloud.ts / auth.ts
already handles optional values, so any future case where `CliExit`
propagates through one of those try/finally blocks won't pollute the
event with `error_class: 'CliExit'`. `bootstrap.ts` updated to use the
same spread pattern for consistency.
Verified: `npx tsc --noEmit` clean; `npx vitest run src/cli` 190/190 —
including the SIGINT force-exit regression test which directly exercises
the signal-handler semantics my wrapper preserves.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Addressed both Devin findings in
Verified: |
Updated README to improve formatting and remove redundant documentation links.
Removed the mention of plain docs for humans and agents from the README.
Documentation for `CliCommandRunEvent.is_tty` says "True when stdin is a TTY — helps tell interactive vs scripted runs apart", but the implementation was reading `process.stdout.isTTY`. That gets both common cases wrong: - `agent-relay status > file.txt` — stdout redirected to a file, stdin is still a TTY (human at the keyboard). Previously reported `is_tty: false` when the run was clearly interactive. - `echo args | agent-relay spawn ...` — stdin piped, stdout attached to a terminal. Previously reported `is_tty: true` when the run was scripted. Switched to `process.stdin.isTTY` to match the doc and the stated purpose. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Devin flagged that `monitoring.ts` had the same bare
`process.on(signal, listener)` pattern that `core.ts` had before my last
fix — so `deps.exit(code)` (shared `defaultExit`, which throws `CliExit`)
from inside a monitoring command's async signal handler would leak an
unhandled rejection and Node would force exit code 1.
Rather than duplicate the wrapper inline at two sites, extracted it into a
shared helper in `src/cli/lib/exit.ts`:
export function runSignalHandler(handler): void {
// ... catch CliExit → flush → process.exit(err.code)
// ... else → console.error + process.exit(1)
}
- `core.ts` — was doing the inline wrapper; now imports and uses
`runSignalHandler`. Same behavior, half the code.
- `monitoring.ts` — was still `process.on(signal, listener)`. Switched to
the wrapper. Any future command that registers a signal handler through
`deps.onSignal` with the shared `defaultExit` now gets the right
semantics by default.
Verified: `tsc --noEmit` clean; `vitest run src/cli` 190/190 (including
the `up force exits on repeated SIGINT` regression test that directly
exercises the wrapped signal handler path).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
cli_command_run+cli_command_complete(success, duration, exit code, error class) — previously only 7 interactive commands emitted any telemetry.workflow_run,cloud_auth,cloud_workflow_run,provider_auth,setup_init,swarm_run,bridge_spawn,broker_start_failed.parseAsync+shutdown()flush so PostHog's event queue drains before the process exits.Why
Today the main product surface —
agent-relay run,cloud login,auth <provider>,swarm,bridge— emits no telemetry. TheinteractiveCommandsgate in bootstrap only fires events forup/down/status/version. We cannot answer basic product questions (which commands fail? which workflows succeed? how long does onboarding take?) from the data we collect.What's in the new schema
cli_command_run(extended)command_name,flags_used,is_tty,is_cicli_command_complete(new)success,duration_ms,exit_code,error_classworkflow_run(new)agent-relay run <file>file_type,is_script,is_dry_run,is_resume,success,duration_mscloud_auth(new)cloud login/logout/whoami/connectaction,success,duration_ms,provider?cloud_workflow_run(new)cloud runhas_explicit_file_type,sync_code,json_output,success,duration_msprovider_auth(new)auth <provider>provider,use_auth_broker,used_token,success,duration_mssetup_init(new)init/setupwizardis_cloud,broker_was_running,user_started_broker,yes_flag,skip_brokerswarm_run(new)swarmpattern,teams,cli,is_list,is_dry_run,exit_code,duration_msbridge_spawn(new)bridgeproject_count,cli,has_architectbroker_start_failed(new)stage,error_classPrivacy
error_class(the Error constructor name) — never the message.AGENT_RELAY_TELEMETRY_DISABLED=1andagent-relay telemetry disable) continues to work; thetelemetrymanagement commands themselves are excluded frominitTelemetryso opt-in/opt-out flips never leak events.Key mechanics
src/cli/bootstrap.tsinstallspreAction/postActionhooks on the root Commander program. Full command path ("cloud login"not"cloud") is derived by walking parent chain; flag names come fromgetOptionValueSource(...) === 'cli'.runCliis nowasyncandawaits bothparseAsyncandshutdownTelemetry(). Top-level try/catch captures thrown errors and emits a failurecli_command_completebefore flushing.process.on('exit')+beforeExitare safety nets for commands that callprocess.exit(code)mid-flight, so acli_command_completestill fires with the exit code.src/cli/index.tshandles the promise returned byrunCliwith.catch()and a non-zero exit on unhandled top-level failures.Test plan
npx tsc --noEmit— cleannpx vitest run— 732/732 passing, including updatedentrypoint.test.tsmocks for the new telemetry import surface and asyncrunCliagent-relay status,agent-relay run some.yaml,agent-relay cloud whoamiwith telemetry enabled against a PostHog dev project and verify the new events arrive with expected propertiesAGENT_RELAY_TELEMETRY_DISABLED=1 agent-relay ...still produces no network trafficFiles
packages/telemetry/src/events.ts,index.tssrc/cli/bootstrap.ts,index.ts,entrypoint.test.tssrc/cli/commands/{cloud,auth,setup,swarm,core}.tssrc/cli/lib/telemetry-helpers.ts(new sharederrorClassName)🤖 Generated with Claude Code