Skip to content

chore(cloud-agent): track stream disconnects and report agent-server crashes#2461

Merged
tatoalo merged 2 commits into
mainfrom
chore/cloud-tasks-observability
Jun 2, 2026
Merged

chore(cloud-agent): track stream disconnects and report agent-server crashes#2461
tatoalo merged 2 commits into
mainfrom
chore/cloud-tasks-observability

Conversation

@tatoalo
Copy link
Copy Markdown
Contributor

@tatoalo tatoalo commented Jun 2, 2026

Problem

We had instances of cloud-run failure invisible from the desktop app and from analytics:

  1. cloud-run SSE watcher gave up reconnecting
  2. hard agent-server crash (uncaught exception / OOM / unhandled rejection)

Changes

  • added analytics to better have visibility on such instances

…crashes

Make two previously-invisible cloud-run failures observable, without
changing any UX.

- Desktop client: emit a "Cloud stream disconnected" PostHog event whenever
  a cloud-run watcher gives up (failWatcher). Carries the error title and the
  three reconnect-budget counts, so an idle Envoy cut can be told apart from a
  real outage, and the event can be joined to run outcomes to see whether the
  underlying run survived.
- Agent server: install uncaughtException / unhandledRejection handlers that
  mark the run failed (reportFatalError) before exiting. A hard crash was
  silent — the run stayed non-terminal and the desktop showed only a generic
  disconnect until the multi-hour inactivity timeout.
@tatoalo tatoalo self-assigned this Jun 2, 2026
@tatoalo tatoalo marked this pull request as ready for review June 2, 2026 11:29
@tatoalo tatoalo marked this pull request as draft June 2, 2026 11:33
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Jun 2, 2026

Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
packages/agent/src/server/bin.ts:193-201
**Crash handler can hang indefinitely before exiting**

`reportFatalError` makes two unbounded network calls (`updateTaskRun` and `eventStreamSender.stop()`). If the PostHog API is slow or unreachable at crash time (e.g., network partition during a container restart), neither call has a timeout, so `process.exit(1)` in the `finally` block would never be reached. This leaves the container alive and blocking pod shutdown. Consider wrapping the body in a `Promise.race` with a short deadline (e.g., 5 s) so the process always exits promptly after a fatal error.

Reviews (1): Last reviewed commit: "chore(cloud-agent): track stream disconn..." | Re-trigger Greptile

Comment thread packages/agent/src/server/bin.ts
reportFatalError makes two unbounded network calls; if the API is slow or
unreachable at crash time, process.exit in the handler's finally would never
run and the container would block pod shutdown. Race it against a 5s deadline
so the process always exits promptly after a fatal error.
@tatoalo tatoalo marked this pull request as ready for review June 2, 2026 11:36
@tatoalo tatoalo requested a review from a team June 2, 2026 11:37
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Jun 2, 2026

Reviews (2): Last reviewed commit: "chore(cloud-agent): bound fatal-error re..." | Re-trigger Greptile

@tatoalo tatoalo merged commit cc15198 into main Jun 2, 2026
21 checks passed
@tatoalo tatoalo deleted the chore/cloud-tasks-observability branch June 2, 2026 11:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants