feat(ocap-kernel): Automatic reconnection with exponential backoff for remote comms by sirtimid · Pull Request #678 · MetaMask/ocap-kernel

sirtimid · 2025-10-27T16:24:43Z

Implements resilient automatic reconnection for remote kernel communications with exponential backoff, message queuing, and intelligent error handling. Remote connections now recover seamlessly from network failures without manual intervention or message loss.

Motivation

Distributed kernel systems experience frequent network disruptions due to:

Machine sleep/wake cycles
Network switches (WiFi ↔ Ethernet, VPN reconnections)
Router restarts and brief internet outages
Transient connection failures

Previously, connections would simply drop and fail. This PR makes the system resilient to these real-world scenarios.

Key Features

Infinite Reconnection with Exponential Backoff

Retries indefinitely with capped exponential backoff (500ms → 1s → 2s → 4s → 8s → 10s max)
Full jitter prevents thundering herd problems
Smart error classification: retries transient errors, bails on permanent failures

Message Queuing

Queues up to 200 outbound messages during reconnection
Automatically flushes queue when connection is restored
No messages lost during brief outages

Sleep/Wake Detection

Detects machine wake via clock jump detection (cross-platform)
Resets backoff counters for fast reconnection after wake
Handles laptop sleep scenarios gracefully

Clean Lifecycle Management

New stop() function for graceful shutdown
AbortSignal propagation cancels all in-flight operations
Prevents zombie dials/delays during kernel restart
Race condition prevention via dialIdempotent()

Technical Implementation

Core Changes:

network.ts: Reconnection orchestration, message queuing, wake detection
RemoteManager.ts: Added stopRemoteComms() lifecycle method
remote-comms.ts: Integrated stop functionality
Platform services (Client/Server/Node.js): Wired through RPC layer

New Utilities (in kernel-utils):

wake-detector.ts: Reusable sleep/wake detection
calculateReconnectionBackoff(): Shared backoff calculation
Enhanced abortableDelay(): AbortSignal-aware delays

New Error Handling (in kernel-errors):

isRetryableNetworkError(): Classifies transient vs. permanent errors
Enhanced AbortError usage throughout

Architecture

Initialization Flow

Browser Environment:

kernel-worker.ts
  ↓ kernel.initRemoteComms(relays)
Kernel.ts
  ↓ remoteManager.initRemoteComms(relays)
RemoteManager.ts
  ↓ initRemoteComms(kernelStore, platformServices, handler, relays, logger)
remote-comms.ts
  ↓ platformServices.initializeRemoteComms(keySeed, knownRelays, handler)
PlatformServicesClient
  ↓ RPC call → 
PlatformServicesServer
  ↓ initNetwork(keySeed, knownRelays, handler)
network.ts
  ✓ Returns { sendRemoteMessage, stop }

Node.js Environment (Simpler):

Kernel → RemoteManager → remote-comms → NodejsPlatformServices → network.ts
(No RPC layer needed)

Cleanup Flow

Kernel.stop()
  ↓ remoteManager.stopRemoteComms()
  ↓ remoteComms.stopRemoteComms()
  ↓ platformServices.stopRemoteComms()
  ↓ [RPC in browser] → PlatformServicesServer.#stopRemoteCommsFunc()
  ↓ network.stop()
    • Clears wake detector interval ✓
    • Aborts all in-flight operations ✓  
    • Stops libp2p ✓
    • Clears all maps ✓

Note

Adds automatic reconnection with exponential backoff and message queuing for remote communications, plus a stopRemoteComms lifecycle, new error utilities, and full platform wiring with tests.

Remote Comms (ocap-kernel):
- Implement ConnectionFactory, MessageQueue, ReconnectionManager, and overhaul remotes/network.ts for autodial, exponential backoff (with jitter), per-peer queues (max 200), and wake-from-sleep handling.
- Add stop() to network init; expose via StopRemoteComms and RemoteComms.stopRemoteComms.
- New RPC method stopRemoteComms with spec/handler; integrated in platform-services.
- Persist/restore remote hints in store (store/methods/remote.ts).
Platform Wiring:
- Browser server/client: wire initializeRemoteComms, sendRemoteMessage, and new stopRemoteComms.
- Node.js PlatformServices: mirror start/stop/send; manage handlers and cleanup.
- Kernel: call remoteManager.stopRemoteComms() during Kernel.stop().
Kernel Utils:
- Add retry, retryWithBackoff, calculateReconnectionBackoff, abortableDelay, and installWakeDetector; export from index.ts.
Errors:
- Add AbortError and isRetryableNetworkError (libp2p/Node codes); export updates.
API/Types:
- Move remote comms types to remotes/types.ts; re-export in package index.
Tests/Coverage:
- Extensive unit/integration tests for reconnection, queuing, wake, RPC, and platform services; update coverage thresholds.
Misc:
- Add @libp2p/interface dep; small tooling/settings tweaks.

^{Written by Cursor Bugbot for commit 8832e09. This will update automatically on new commits. Configure here.}

FUDCo

Not done reading the code yet, but publishing the first batch of comments.

FUDCo · 2025-10-29T23:29:00Z

packages/ocap-kernel/src/remotes/network.ts

+  const connectionFactory = new ConnectionFactory(
+    keySeed,
+    knownRelays,
+    logger,
+    signal,
+  );

-  const libp2p = await createLibp2p({
-    privateKey,
-    addresses: {
-      listen: [
-        // TODO: Listen on tcp addresses for Node.js
-        // '/ip4/0.0.0.0/tcp/0/ws',
-        // '/ip4/0.0.0.0/tcp/0',
-        // Browser: listen on WebRTC and circuit relay
-        '/webrtc',
-        '/p2p-circuit',
-      ],
-      appendAnnounce: ['/webrtc'],
-    },
-    transports: [
-      webSockets(),
-      webTransport(),
-      webRTC({
-        rtcConfiguration: {
-          iceServers: [
-            {
-              urls: [
-                'stun:stun.l.google.com:19302',
-                'stun:global.stun.twilio.com:3478',
-              ],
-            },
-          ],
-        },
-      }),
-      circuitRelayTransport(),
-    ],
-    connectionEncrypters: [noise()],
-    streamMuxers: [yamux()],
-    connectionGater: {
-      // Allow private addresses for local testing
-      denyDialMultiaddr: async () => false,
-    },
-    peerDiscovery: [
-      bootstrap({
-        list: knownRelays,
-      }),
-    ],
-    services: {
-      identify: identify(),
-      ping: ping(),
-    },
-  });
-
-  // Detailed logging for libp2p events. Uncomment as needed. Arguably this
-  // should be controlled by an environment variable or some similar kind of
-  // runtime flag, but probably not worth the effort since when you're debugging
-  // you're likely going to be tweaking with the code a lot anyway.
-  /*
-  const eventTypes = [
-    'certificate:provision',
-    'certificate:renew',
-    'connection:close',
-    'connection:open',
-    'connection:prune',
-    'peer:connect',
-    'peer:disconnect',
-    'peer:discovery',
-    'peer:identify',
-    'peer:reconnect-failure',
-    'peer:update',
-    'self:peer:update',
-    'start',
-    'stop',
-    'transport:close',
-    'transport:listening',
-  ];
-  for (const et of eventTypes) {
-    libp2p.addEventListener(et as keyof Libp2pEvents, (event) => {
-      if (et === 'connection:open' || et === 'connection:close') {
-        const legible = (raw: any): string => JSON.stringify({
-          direction: raw.direction,
-          encryption: raw.encryption,
-          id: raw.id,
-          remoteAddr: raw.remoteAddr.toString(),
-          remotePeer: raw.remotePeer.toString(),
-        });
-        logger.log(`@@@@ libp2p ${et} ${legible(event.detail)}`, event.detail);
-      } else if (et === 'peer:identify') {
-        const legible = (raw: any): string => JSON.stringify({
-          peerId: raw.peerId ? raw.peerId.toString() : 'undefined',
-          protocolVersion: raw.protocolVersion,
-          agentVersion: raw.agentVersion,
-          observedAddr: raw.observedAddr ? raw.observedAddr.toString() : 'undefined',
-          listenAddrs: raw.listenAddrs.map((addr: object) => addr ? addr.toString() : 'undefined'),
-          protocols: raw.protocols,
-        });
-        logger.log(`@@@@ libp2p ${et} ${legible(event.detail)}`, event.detail);
-      } else if (et === 'transport:listening') {
-        const legible = (raw: any): string => JSON.stringify(raw.getAddrs());
-        logger.log(`@@@@ libp2p ${et} ${legible(event.detail)}`, event.detail);
-      } else {
-        logger.log(`@@@@ libp2p ${et} ${JSON.stringify(event.detail)}`, event.detail);
-      }
-    });
-  }
-  */
+  // Initialize the connection factory
+  await connectionFactory.initialize();


I'm thinking perhaps ConnectionFactory should follow the pattern (as found in, e.g., Kernel, VatHandle, and RemoteHandle) of making the constructor and initialization methods private and providing a public static make method so that the uninitialized object is never exposed publicly.

Ok done bd5dda1

FUDCo · 2025-10-29T23:58:57Z

packages/ocap-kernel/src/remotes/ConnectionFactory.ts

+  async dialIdempotent(
+    peerId: string,
+    hints: string[],
+    withRetry: boolean,
+  ): Promise<Channel> {
+    let promise = this.#inflightDials.get(peerId);
+    if (!promise) {
+      promise = (
+        withRetry
+          ? this.openChannelWithRetry(peerId, hints)
+          : this.openChannelOnce(peerId, hints)
+      ).finally(() => this.#inflightDials.delete(peerId));
+      this.#inflightDials.set(peerId, promise);
+    }


It feels to me like there's an impedance mismatch between the withRetry parameter and the collection of in-flight dials. Basically, the first caller determines the retry behavior and the flag is thenceforth ignored on successive attempts. Arguably this is not wrong, and I'm not sure I'd do it differently, but it smells off.

The first caller sets the retry behavior, and later concurrent calls for the same peer reuse the same in-flight promise, so their withRetry value is ignored, but this is intentional. We deduplicate in-flight dials (only one active dial per peer at a time) which requires picking a single retry strategy per attempt.

I think for this PR keeping the current approach is fine. Concurrent callers for the same peer get the same result (success or failure), and we avoid duplicate dials. But we could track dials separately per (peerId, withRetry) combination. wdyt?

FUDCo · 2025-10-30T00:46:30Z

packages/ocap-kernel/src/remotes/network.ts

+        // Detect graceful disconnect
+        const rtcProblem = problem as {
+          errorDetail?: string;
+          sctpCauseCode?: number;
+        };
        if (
-          rtcProblem.errorDetail === 'sctp-failure' &&
+          rtcProblem?.errorDetail === 'sctp-failure' &&


Does "graceful disconnect" mean "the other end deliberately closed the connection"? Because in that case I don't think we should be reconnecting -- though I don't see how this code actually reacts to that case anyway. But there needs to be some way to close a connection on purpose.

Yes, "graceful disconnect" means the other end deliberately closed the connection. SCTP_USER_INITIATED_ABORT (cause code 12) indicates an intentional close by the remote peer, not a network failure.

You're right: the code detects this case (lines 135-139) but doesn't prevent reconnection. It only logs "remote disconnected" instead of an error, then calls handleConnectionLoss() on line 148, which triggers reconnection. We should distinguish intentional disconnects from transient failures and avoid auto-reconnecting for the former.

This is out of scope for this PR, which focuses on automatic reconnection with exponential backoff for network failures (timeouts, connection resets, etc.). Handling intentional disconnects would require:

A mechanism to signal an intentional close

Distinguishing intentional closes from failures in the error handling

Preventing reconnection when the close was intentional

I suggest we create a follow-up issue to track intentional disconnect handling, including proper connection lifecycle management and explicit close operations. For now, this PR maintains current behavior (treating all disconnects as recoverable) while adding retry/backoff for network failures.

FUDCo · 2025-10-30T00:49:44Z

packages/ocap-kernel/src/remotes/network.ts

+   */
+  function handleConnectionLoss(peerId: string, hints: string[] = []): void {
+    logger.log(`${peerId}:: connection lost, initiating reconnection`);
+    channels.delete(peerId);


My original intent was that a "channel" was a logical abstraction that would survive loss of the underlying connection, whereas a "connection" was the abstraction of a concrete communications link that could be disrupted by network problems. This looks to me like that abstraction distinction got discarded by this refactor, though I obviously could be missing something in the big PR. It seems like that's an important distinction to keep track of, but maybe you've thought it through more deeply?

Right: Channel is currently tied to the physical connection. When the connection dies, we delete the channel (line 166), so channels don't survive connection loss.

The abstraction that persists is at a higher level:

messageQueues buffer messages across disconnections

reconnectionManager tracks reconnection state per peer

New connections reuse the same peer identity and deliver queued messages

So messages and peer state persist, but Channel objects don't.

To make Channel a true logical abstraction that survives connection loss, we'd need to:

Keep channel objects alive across connections

Have channels reference the current connection (which changes)

Separate channel lifecycle from connection lifecycle

That's a larger architectural change beyond this PR's scope. For now, queues and reconnection provide the logical abstraction, while Channel remains tied to the physical connection.

Should we track this as a follow-up?

FUDCo · 2025-10-30T00:56:21Z

packages/kernel-utils/src/wake-detector.ts

@@ -0,0 +1,65 @@
+/**
+ * Options for configuring the wake detector.


The phrase "wake detector" here obviously refers to recognizing awakening, i.e., coming back from being asleep, but somehow it keeps registering to me as the thing that reports that your boat is going too fast inside the boundaries of the marina.

Should I rename it? :)

FUDCo · 2025-10-30T01:02:28Z

packages/kernel-utils/src/wake-detector.ts

+  let last = Date.now();
+
+  const intervalId = setInterval(() => {
+    const now = Date.now();
+    if (now - last > intervalMs + jumpThreshold) {
+      // Clock jumped forward significantly - probable wake from sleep
+      onWake();
+    }
+    last = now;
+  }, intervalMs);


Since the interval is being tracked by state that is kept in memory, this code clearly is dealing with time discontinuities within a single executing process. But it seems to me that we also need to detect discontinuities across successive incarnations of a kernel. This is obviously not that (you wouldn't detect those by looking at the clock anyway), but is that case also handled somewhere (e.g., as part of kernel startup)?

Yes, the current detector only works within a single running process. It tracks time in memory, so it won't detect sleep/wake across kernel restarts.

Cross-incarnation detection would require different logic:

Store the last known timestamp in persistent storage (e.g., kernel store) when shutting down

On startup, compare the stored timestamp with the current time

If the gap exceeds a threshold, treat it as a wake event

This isn't currently handled at kernel startup. We could add it if needed. It would need:

Writing a timestamp on shutdown (or periodically)

Checking it on startup in initNetwork (or during kernel initialization)

Resetting backoffs if a time discontinuity is detected

We can track this as a follow-up:

Cross-incarnation wake detection: Detect time discontinuities across kernel restarts by storing the last known timestamp in persistent storage on shutdown and comparing it with the current time on startup. If the gap exceeds a threshold (indicating the system was asleep between shutdown and startup), reset reconnection backoffs similar to the runtime wake detector. This ensures reconnection backoffs are reset even when the kernel wasn't running during the sleep period.

FUDCo

I think this is good as far as it goes, but it's stimulating a lot of concerns on my part that we haven't thoroughly thought through the lifecycle model of the relationship between communicating objects living in vats on separate machines, given that this relationship can be disrupted by not only the network but the uptime of the browsers that are hosting the respective kernels and the machines that are hosting the respective browsers.

In particular, we generally consider network disruptions to be transient errors (though the transience may span all kinds of things -- not just a TCP connection drop but possibly rehosting an endpoint at an entirely different address, such as when I recently switched my household internet from Comcast Business to AT&T Fiber).

On the other hand, browser or host uptime disruptions can include both transient events (such as power failures) and intentional acts by a user at one of the endpoints (e.g., I decide I don't want to run this thing any more). I think the state of our thinking on this stuff is currently rather muddled. I also strongly expect that the path out of this muddle is going to involve a lot of practical trial and error experience aside from whatever technical brilliance we may bring to the party. In other words, I think this may take time and experimentation rather than just raw intellect.

Stopping now because Consensys IT is insisting on rebooting my machine NOW.

FUDCo · 2025-10-30T18:30:33Z

packages/kernel-errors/src/utils/isRetryableNetworkError.ts

+ * @param error - The error to check if it is retryable.
+ * @returns True if the error is retryable, false otherwise.
+ */
+export function isRetryableNetworkError(error: unknown): boolean {


Which conditions would constitute a non-retryable network error? I think we'll get an ECONNRESET if a kernel is shutdown and probably ECONNREFUSED or EHOSTUNREACH if it's just not running when you try to talk to it, so do we have a way to distinguish "it's not running now" from "it will never be running again" from "you have the wrong address"?

Yeah, the function doesn't distinguish between "not running now" (temporary) "will never be running again" (permanent) and "wrong address" (permanent). All listed network errors (ECONNRESET, ECONNREFUSED, EHOSTUNREACH, etc.) are treated as retryable, so we rely on maxAttempts to eventually stop retrying.
We can't distinguish these scenarios from error codes alone. I'll create a follow up task

Error pattern analysis and permanent failure detection: Implement heuristics to distinguish temporary network failures from permanent failures. Track error patterns over time (consecutive identical errors, error frequency) and classify persistent failures as permanently non-retryable:

Track the pattern of errors across reconnection attempts

If the same error code (e.g., ECONNREFUSED, EHOSTUNREACH) persists across many attempts without any success, classify it as permanently non-retryable

Detect "wrong address" scenarios through persistent connection refusal patterns

Integrate with reconnection logic to stop retrying when patterns indicate permanent failure

This would enhance isRetryableNetworkError to become stateful (tracking patterns) or add a separate mechanism that feeds into retry decisions

FUDCo · 2025-10-30T18:33:27Z

packages/kernel-utils/src/misc.ts

+      signal.addEventListener('abort', onAbort, { once: true });
+    }
+  });
+}


I think Bugbot is right about this one.

FUDCo · 2025-10-30T18:40:35Z

packages/ocap-kernel/src/Kernel.ts

   * Gracefully stop the kernel without deleting vats.
   */
  async stop(): Promise<void> {
+    await this.#remoteManager.stopRemoteComms();


I like having a provision for an orderly shutdown, but I fear that this will rarely happen in the wild. Instead, I think it far more likely that somebody just quits their browser, taking any running kernel and its vats with it. Outside parties that had a communications relationship with whatever had been running in there might have to wait a very long time before being able to reconnect. I'm not sure what this means in terms of maximum retry backoff limits, but I suspect it means something.

Will be handled on a follow up

FUDCo · 2025-10-30T18:42:07Z

packages/ocap-kernel/src/remotes/MessageQueue.ts

+ * Handles queueing of messages with their hints during reconnection.
+ */
+export class MessageQueue {
+  readonly #queue: QueuedMessage[] = [];


I'm wondering if queued messages need to be backed up in persistent storage.

Yeah we can also track this as a follow up task

…r remote communications

FUDCo

There's lots still to do here, but I concur that it's all stuff for follow-on issues.
Let's plug it in and give it the smoke test.

sirtimid changed the title ~~Sirtimid/remote comms automatic reconnection~~ feat(ocap-kernel): Automatic reconnection with exponential backoff for remote communications Oct 27, 2025

sirtimid changed the title ~~feat(ocap-kernel): Automatic reconnection with exponential backoff for remote communications~~ feat(ocap-kernel): Automatic reconnection with exponential backoff for remote comms Oct 27, 2025

sirtimid marked this pull request as ready for review October 27, 2025 16:36

sirtimid requested a review from a team as a code owner October 27, 2025 16:36

This comment was marked as outdated.

Sign in to view

sirtimid force-pushed the sirtimid/remote-comms-Automatic-Reconnection branch from 73f8ae0 to 7408f06 Compare October 28, 2025 19:28

This comment was marked as outdated.

Sign in to view

sirtimid force-pushed the sirtimid/remote-comms-Automatic-Reconnection branch from 879c6a9 to 86e97d0 Compare October 29, 2025 16:55

FUDCo reviewed Oct 30, 2025

View reviewed changes

sirtimid added 16 commits November 3, 2025 18:30

Refactor: Extract VatManager and SubclusterManager from Kernel class

8a5ad7f

clean

1db164e

Channel Connection Retry with Exponential Backoff

67420e1

fix bug and tests

3208f31

remove default values

763be47

fxi bug

31c0afa

feat(ocap-kernel): Automatic reconnection with exponential backoff fo…

5febd52

…r remote communications

update lockfile

b230619

remove comment

a196ab8

fix types

ec58c49

fix build

6976288

fix retries on failure and add tests

6320edd

refactor network

bf582ab

add more tests

ec8d724

fix bug

fa3a168

merge

878705f

sirtimid added 3 commits November 3, 2025 18:30

fix bugs

6a1ece1

fxi bugs

0ba65d1

ConnectionFactory make pattern

bd5dda1

sirtimid force-pushed the sirtimid/remote-comms-Automatic-Reconnection branch from 86e97d0 to bd5dda1 Compare November 3, 2025 17:58

This comment was marked as outdated.

Sign in to view

sirtimid added 2 commits November 3, 2025 20:05

Fix Concurrent reconnection loops and Stuck reconnecting state

7c9a580

add timeouts

8832e09

sirtimid requested a review from FUDCo November 3, 2025 20:29

sirtimid linked an issue Nov 3, 2025 that may be closed by this pull request

Remote comms: Add Automatic Reconnection Logic #659

Closed

FUDCo approved these changes Nov 3, 2025

View reviewed changes

sirtimid merged commit 64b1042 into main Nov 3, 2025
26 checks passed

sirtimid deleted the sirtimid/remote-comms-Automatic-Reconnection branch November 3, 2025 22:16

sirtimid linked an issue Nov 4, 2025 that may be closed by this pull request

Remote comms: Implement Message Queuing During Disconnection #658

Closed

sirtimid mentioned this pull request Nov 4, 2025

Remote comms: Implement Message Queuing During Disconnection #658

Closed

		@@ -0,0 +1,65 @@
		/**
		* Options for configuring the wake detector.

Conversation

sirtimid commented Oct 27, 2025 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Key Features

Infinite Reconnection with Exponential Backoff

Message Queuing

Sleep/Wake Detection

Clean Lifecycle Management

Technical Implementation

Architecture

Initialization Flow

Browser Environment:

Node.js Environment (Simpler):

Cleanup Flow

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

FUDCo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sirtimid Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

FUDCo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

FUDCo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sirtimid commented Oct 27, 2025 •

edited by cursor bot

Loading

sirtimid Nov 3, 2025 •

edited

Loading