Skip to content

fix(proxy): GOAWAY-aware APNs client + auto-prune stale tokens#14

Merged
DocNR merged 1 commit into
mainfrom
fix/proxy-apns-resilience
Apr 28, 2026
Merged

fix(proxy): GOAWAY-aware APNs client + auto-prune stale tokens#14
DocNR merged 1 commit into
mainfrom
fix/proxy-apns-resilience

Conversation

@DocNR
Copy link
Copy Markdown
Owner

@DocNR DocNR commented Apr 28, 2026

Summary

Fixes a real production incident from 2026-04-27 evening: the proxy stopped sending APNs pushes for ~6 hours (~18:35 → ~02:13 EDT). Apple sent a GOAWAY frame; our reconnect logic only detected error and close. After GOAWAY the session is not `destroyed` — it just refuses new streams. Cached-session check kept returning the dead one for every `sendPush`, producing `NGHTTP2_REFUSED_STREAM` per request with no recovery. Compliance events kept being received from the relay; APNs delivery hit zero. Force-closed Clave installs got no NSE wakes for the entire window.

Restarted the proxy manually to recover. This PR makes that scenario self-healing.

Bonus: the user's account had accumulated 199 `BadDeviceToken` responses in 24h because we previously only pruned on HTTP 410. Apple lumps 400 `BadDeviceToken` and 410 `Unregistered` as terminal token states — both now prune.

Changes

New `relay-proxy/apnsClient.js`

Extracted into a testable module.

  • Listens for `goaway` and `frameError` events explicitly; calls `invalidateSession(reason)` to discard the cached session.
  • `sendPush` catches synchronous throws from `client.request(...)` and auto-retries once on a fresh session for fatal-session error codes (`ERR_HTTP2_GOAWAY_SESSION`, `NGHTTP2_REFUSED_STREAM`, `ECONNRESET`, etc.).
  • Periodic HTTP/2 PING every 5 min keeps the kernel + Apple's edge socket warm; ping failure invalidates the session before the next real push.
  • Operational counters: `sessionConnects`, `sessionInvalidations`, `sendOk`, `sendFail`, `sendRetried`, `pruneOnBadDeviceToken`, `pruneOnUnregistered`, `lastSendAt`, `lastFailureAt`, `lastFailureReason`, `sessionAlive`, `sessionAgeSeconds`.

`proxy.js`

  • Swap inline impl for the new module (-94 lines, +42).
  • Use `shouldPruneToken(status, body)` so `400 BadDeviceToken` / `DeviceTokenNotForTopic` now prune, not just 410.
  • Log the pruning reason: `[APNs] Removed stale token: ... (BadDeviceToken)`.
  • `/health` now includes an `apns:` block — lets us spot the "stuck dead" state from outside the process via `curl`.
  • Graceful shutdown on SIGTERM/SIGINT closes the APNs session cleanly before exit.

`test/apnsClient.test.js`

14 unit tests covering the pure helpers. The exact error string from the prod incident — `"New streams cannot be created after receiving a GOAWAY"` — is locked in as a regression guard.

Test plan

  • `node --check proxy.js` → OK
  • `node --test test/apnsClient.test.js` → 14/14 pass
  • `node --test test/` → 64/65 pass (the 1 failure is the pre-existing `nip98.test.js` ESM-on-Node-18 issue documented in PROJECT-STATE.md, unrelated)
  • Deploy to Dell: `sudo cp relay-proxy/proxy.js relay-proxy/apnsClient.js /opt/clave-proxy/ && sudo systemctl restart clave-proxy`
  • Verify post-deploy:
    • `curl -sS https://proxy.clave.casa/health | jq .apns` shows `sessionAlive: true` and counters incrementing
    • After a few signs: `sendOk` increments, `lastSendAt` updates
    • After ~24h, the user's stale token is pruned (`pruneOnBadDeviceToken` ≥ 1; `tokens.json` no longer contains it)
  • Resilience smoke test (post-deploy, low-pressure): monitor `/health.apns.sessionInvalidations` over a week — expect a small number (Apple does send periodic GOAWAYs); each invalidation should be followed by a successful `sendOk` increment within minutes, proving auto-recovery works.

🤖 Generated with Claude Code

…uning

Fixes a 6-hour APNs outage on 2026-04-27 evening (~18:35 → ~02:13 EDT).
Apple sent a GOAWAY frame; the previous reactive reconnect logic only
nulled the cached session on `error` and `close` events. After GOAWAY
the session is NOT `destroyed` — it just refuses new streams. The check
`if (apnsClient && !apnsClient.destroyed) return apnsClient;` happily
returned the dead session for every subsequent `sendPush`, producing
NGHTTP2_REFUSED_STREAM per request with no recovery. Compliance events
kept being received from the relay, but APNs delivery hit zero.
Force-closed Clave installs got no NSE wakes for the entire window.

The user's account also accumulated 199 BadDeviceToken responses in 24h
because the previous code only pruned tokens on HTTP 410, not on 400 +
`reason: BadDeviceToken`. Apple lumps both as terminal token states.

Changes:

* New `relay-proxy/apnsClient.js` extracts the APNs client into a
  testable module:
  - Listens for `goaway` and `frameError` events explicitly; calls
    `invalidateSession(reason)` to discard the cached session.
  - `sendPush` catches synchronous throws from `client.request(...)`
    (session destroyed mid-call previously bubbled up uncaught) and
    auto-retries once on a fresh session when the error code is one of
    `ERR_HTTP2_GOAWAY_SESSION`, `NGHTTP2_REFUSED_STREAM`, `ECONNRESET`,
    etc. (full list in `FATAL_SESSION_CODES` / `FATAL_NGHTTP2_CODES`).
  - Periodic HTTP/2 PING every 5 min keeps the kernel + Apple's edge
    socket warm; ping failure invalidates the session before the next
    real push, surfacing dead connections proactively.
  - Operational counters: `sessionConnects`, `sessionInvalidations`,
    `sendOk`, `sendFail`, `sendRetried`, `pruneOnBadDeviceToken`,
    `pruneOnUnregistered`, `lastSendAt`, `lastFailureAt`,
    `lastFailureReason`, `sessionAlive`, `sessionAgeSeconds`.

* `proxy.js` swaps the inline implementation for the new module:
  - Uses `shouldPruneToken(status, body)` so 400 + BadDeviceToken /
    DeviceTokenNotForTopic now prune, not just 410.
  - Logs the pruning reason: `[APNs] Removed stale token: ... (BadDeviceToken)`.
  - `/health` now includes an `apns:` block exposing all the above
    counters — lets us spot the "stuck dead" state from outside the
    process via curl.
  - Graceful shutdown on SIGTERM/SIGINT closes the APNs session before
    exit so Apple sees a clean disconnect rather than a half-open
    socket on every `systemctl restart`.

* `test/apnsClient.test.js`: 14 unit tests covering the pure helpers
  (`isSessionFatalError`, `shouldPruneToken`, `parseReason`). The
  scenario from the prod incident — `new Error("New streams cannot be
  created after receiving a GOAWAY")` — is locked in as a regression
  guard.

Verification:
- node --check proxy.js → OK
- node --test test/apnsClient.test.js → 14/14 pass
- node --test test/ → 64/65 pass (the one failure is the pre-existing
  nip98 ESM-on-Node-18 issue documented in PROJECT-STATE.md, unrelated)
- Live-tested deploy is the next step (sudo cp proxy.js apnsClient.js
  to /opt/clave-proxy/ + systemctl restart, then watch /health.apns
  counters increment over a few signs).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@DocNR DocNR merged commit d7ab10a into main Apr 28, 2026
@DocNR DocNR deleted the fix/proxy-apns-resilience branch April 28, 2026 13:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant