Skip to content

Spec 027 — Agent telemetry ingestion + reference agent#81

Merged
Copxer merged 3 commits into
mainfrom
spec/027-agent-telemetry-ingestion
May 2, 2026
Merged

Spec 027 — Agent telemetry ingestion + reference agent#81
Copxer merged 3 commits into
mainfrom
spec/027-agent-telemetry-ingestion

Conversation

@Copxer
Copy link
Copy Markdown
Owner

@Copxer Copxer commented May 2, 2026

Closes #80

Spec: `specs/phase-6-docker-hosts/027-agent-telemetry-ingestion.md`

Summary

  • POST /agent/telemetry wired with a single `AuthenticateAgent` middleware that does both bearer-token auth and per-token rate limiting (60 req/min/token). Hashing matches spec 026's `AgentToken::hash()`. 401 on missing / malformed / revoked / archived-host. 429 with `Retry-After` on overflow.
  • Payload pipeline: `IngestTelemetryRequest` validates host metrics + facts + container array, plus a ±1h past / +5min future skew window on `recorded_at`. `IngestHostTelemetryAction` (transactional) updates host facts, flips `pending → online` (skipping no-op writes when already Online), inserts one host snapshot, then dispatches `SyncContainerSnapshotsAction` to upsert each container on `(host_id, container_id)` and append a per-container snapshot.
  • Reference Node agent at `agent/reference-agent.mjs` (Node 20+, no deps, single file) plus `agent/README.md` documenting env vars, exit codes, payload contract, and a sample systemd unit. Pointing `NEXUS_URL` + `NEXUS_AGENT_TOKEN` at `composer run dev` produces telemetry rows within one interval.

Test plan

  • `POST /agent/telemetry` with a valid bearer + payload returns 204 and persists host metadata + 1 host snapshot + N container snapshots.
  • Endpoint rejects: missing bearer / malformed bearer / revoked token / token belonging to an archived host / unknown token (all 401).
  • Token's `last_used_at` stamped on success only — a 429-throttled or 401-rejected request does not advance it.
  • First successful telemetry from a `pending` host flips status to `online` and stamps `last_seen_at`.
  • Per-token rate limit: 61st request inside 60 s returns 429 with `Retry-After`. Each token gets its own bucket.
  • Payload outside the skew window (≤1 h past / +5 min future), or with non-string `recorded_at` (`0`, `false`, `[]`), rejected with 422.
  • Reference agent script + README live under `agent/`.
  • `vendor/bin/pint --dirty` clean.
  • `php artisan test` — 460 passing (+30 vs main; 22 first-pass tests + 4 more from review).
  • `npm run build` succeeds.

Self-review notes

Self-review pass via `superpowers:code-reviewer` flagged 5 should-fix items + several nice-to-haves. Addressed in commit 4536780:

  1. `StartSession` was running on every agent POST. Agents are non-browser JSON clients on a 30s heartbeat — at 50 hosts that's ~144k orphan session rows per day, plus a `Set-Cookie` on every 204. Route now uses `withoutMiddleware([EncryptCookies, AddQueuedCookiesToResponse, StartSession, ShareErrorsFromSession, HandleInertiaRequests, AddLinkHeadersForPreloadedAssets, PreventRequestForgery])`. (`PreventRequestForgery` had to go too — it refreshes the XSRF cookie at response time even when the path is in the except list.) Locked in by a feature test asserting no cookies on the response.
  2. Skip redundant `status` writes. `IngestHostTelemetryAction` was issuing `UPDATE hosts SET status='online'` on every healthy heartbeat. Now only writes when the status would actually change. Plus a test asserting the no-op path is taken.
  3. `recorded_at` integer / boolean / array values. Could squeak past `date` validation in some Laravel versions and bypass the skew check. Added `test_validation_rejects_non_string_recorded_at`.
  4. Host isolation for container_id collision. Two hosts running a container with the same id should each get their own row. Added `test_same_container_id_on_two_hosts_creates_two_distinct_rows`.
  5. Spec drift. Plan §3+§4 still described the abandoned `throttle:agent-telemetry` approach. Work log now records both implementation deviations (rate-limit-in-middleware, web-stack-strip) with rationale; acceptance criteria checked off; status flipped to `done`.

Implementation deviations from the spec (worth knowing)

  • Rate limiting moved into `AuthenticateAgent` middleware rather than a Laravel `throttle:` named limiter. Tried the named-limiter approach first; it failed because Laravel's default `MiddlewarePriority` runs `ThrottleRequests` before any unlisted custom middleware, so the named-limiter callback always saw `agent_token = null`. Options were (a) inject `AuthenticateAgent` into the priority list (brittle), or (b) collapse auth + throttle into one middleware. Went with (b); the rate limit now fires only after we've identified the token, which is what we wanted anyway.

Deferred

  • Container removal sweep (out of scope; future spec).
  • Activity events for `host.online` / `host.recovered` / `container.unhealthy` (deferred to spec 029).
  • Promoting host issues to `alerts` rows (Phase 7).
  • Production-grade Go agent (roadmap §8.7's "later").

Copxer added 3 commits May 1, 2026 17:11
Wires `POST /agent/telemetry` so a Nexus agent on a Docker host can push
host + container stats. Builds on spec 026's hashed agent tokens.

- AuthenticateAgent middleware (alias `agent.auth`): hashes the
  Authorization bearer with sha256, looks up an active token, rejects
  401 on miss/malformed/revoked/archived-host. Stamps last_used_at
  after a successful auth + rate-limit pass and attaches the resolved
  Host + AgentToken to the request attributes.
- Per-token rate limit (60 req/min) lives inside the same middleware
  rather than a separate `throttle:` named limiter, because Laravel's
  default middleware priority runs ThrottleRequests before unlisted
  custom middleware — a named limiter keyed off the request attribute
  would always see null. Returns 429 with Retry-After.
- IngestTelemetryRequest validates the full payload + ±1h past /
  +5min future skew window on `recorded_at`.
- IngestHostTelemetryAction wraps in a single transaction: updates
  host facts (only when present), flips status `pending` → `online`,
  inserts one host_metric_snapshots row, dispatches container array.
- SyncContainerSnapshotsAction upserts containers on
  (host_id, container_id) and appends one container_metric_snapshots
  row per entry. Container removal is deferred — missing rows are
  left intact.
- Reference Node agent at agent/reference-agent.mjs (single file, no
  deps, Node 20+) plus agent/README.md documenting env vars, exit
  codes, payload contract, and a sample systemd unit.
- Tests: 22 new (5 middleware auth cases + last_used_at gating, 7
  controller cases incl. 3 rate-limit cases, 4 ingest-action unit
  tests, 4 container-sync unit tests). Full suite 456 passing.

CSRF exclusion for `agent/telemetry` registered alongside the GitHub
webhook entry.
- Strip session/cookie/Inertia stack from /agent/telemetry route. Agents
  are non-browser JSON clients on a 30s heartbeat — leaving the web
  group middleware on the route would spawn ~144k orphan session rows
  per day at 50 hosts and emit Set-Cookie on every 204. Verified by
  a feature test that asserts no cookies on the response.
- Skip Online → Online status writes in IngestHostTelemetryAction so
  every healthy heartbeat doesn't issue a redundant UPDATE.
- Add tests: non-string recorded_at (0 / false / []), host-isolation
  for same container_id on two hosts, status-no-op when already Online,
  no session middleware on agent endpoint.
- Flip spec status to done, tick acceptance criteria, record both
  implementation deviations (rate-limit-in-middleware, web-stack-strip)
  in the work log so future readers have the rationale.
- Bump phase-6 + master tracker (1/4 → 2/4 specs done).

Tests grew 22 → 26 new (full suite 460 passing). Pint clean, build green.
@Copxer Copxer merged commit b69cd64 into main May 2, 2026
1 check passed
@Copxer Copxer deleted the spec/027-agent-telemetry-ingestion branch May 2, 2026 00:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Spec 027 — Agent telemetry ingestion + reference agent

1 participant