perf(pam launch): shrink tunnel-open sleeps and parallelize WebSocket… by idimov-keeper · Pull Request #1982 · Keeper-Security/Commander

idimov-keeper · 2026-04-22T22:55:28Z

… listener

Builds on the PR1 TunnelDAG caching. Measured against the post-PR1 baseline (~12.4s grand total), this brings pam launch down to ~10.3s (another ~2.1s saved; cumulative ~6.7s vs pre-PR1 baseline).

Tune the hardcoded sleeps in the WebRTC tunnel open path, then use those savings in an adaptive-fallback retry so the fast path stays fast but the unlucky first-try-fail path still gets the legacy safety window on the retry.

Sleep / polling changes (all env-tunable via the new helpers in connect_timing.py):

WEBSOCKET_BACKEND_DELAY default 2.0s → 0.30s (router/gateway conversation-registration window). Saves ~1.7s on the happy path.
Hardcoded time.sleep(1) before the offer POST is replaced with pre_offer_delay_sec(), default 0.0s (the preceding backend delay already covers router registration). Saves ~1.0s. Set PAM_PRE_OFFER_LEGACY=1 to restore the 1.0s wait.
PAM_OPEN_CONNECTION_DELAY default 0.2s → 0.05s. The existing open_handler_connection retry loop (exponential backoff) already handles slow DataChannel readiness, so the fixed sleep was mostly redundant. Saves ~150ms.
WebRTC connection-state poll tick 100ms → 25ms via new PAM_WEBRTC_POLL_MS env var. Cheap FFI call; tightens P99 handoff latency.

Parallelize WebSocket listener with tube creation:

start_websocket_listener is now called before create_tube, right after signal_handler is wired to tunnel_session. The Rust tube creation (~500ms) runs in parallel with the WebSocket TLS handshake and router registration instead of serially after.
The listener only reads conversation_id from tunnel_session for routing; tube_id is used only for the thread name and log context, so the temp-UUID-to-real-tube-id swap after create_tube is race-free (the gateway doesn't emit messages until it receives our offer, which happens after the swap).

Gateway-offer retry with adaptive backend-delay catch-up:

Unified retry loop (PAM_GATEWAY_OFFER_MAX_ATTEMPTS, default 2) wraps router_send_action_to_gateway for both streaming and non-streaming paths. A local helper _send_gateway_offer_with_retry replaces two near-identical inline call sites.
On a first-attempt failure that looks transient (timeout, rrc_timeout, bad_state, 502/503/504, controller_down), before the retry we sleep offer_retry_extra_delay_sec() + (legacy_backend_delay - fast_backend_delay) so the cumulative wait matches the pre-change legacy 2.0s behavior for the cold-router case. Fast path stays fast; unlucky launch still gets the full safety window on retry.
New checkpoints gateway_offer_backend_catchup_delay_{start,done} and gateway_offer_http_attempt_{N} make the retry path visible in PAM_CONNECT_TIMING=1 output.

WebSocket listener checkpoint renamed
websocket_listener_started → websocket_listener_started_early to reflect its new position in the flow.

Verified in QA: happy path ~10.3s (was 12.4s), gateway-offline retry case exercises the full adaptive 2.95s catch-up (1.25s retry + 1.7s backend-delay delta) exactly as designed before re-attempt.

… listener Builds on the PR1 TunnelDAG caching. Measured against the post-PR1 baseline (~12.4s grand total), this brings ``pam launch`` down to ~10.3s (another ~2.1s saved; cumulative ~6.7s vs pre-PR1 baseline). Tune the hardcoded sleeps in the WebRTC tunnel open path, then use those savings in an adaptive-fallback retry so the fast path stays fast but the unlucky first-try-fail path still gets the legacy safety window on the retry. Sleep / polling changes (all env-tunable via the new helpers in ``connect_timing.py``): - ``WEBSOCKET_BACKEND_DELAY`` default 2.0s → 0.30s (router/gateway conversation-registration window). Saves ~1.7s on the happy path. - Hardcoded ``time.sleep(1)`` before the offer POST is replaced with ``pre_offer_delay_sec()``, default 0.0s (the preceding backend delay already covers router registration). Saves ~1.0s. Set ``PAM_PRE_OFFER_LEGACY=1`` to restore the 1.0s wait. - ``PAM_OPEN_CONNECTION_DELAY`` default 0.2s → 0.05s. The existing ``open_handler_connection`` retry loop (exponential backoff) already handles slow DataChannel readiness, so the fixed sleep was mostly redundant. Saves ~150ms. - WebRTC connection-state poll tick 100ms → 25ms via new ``PAM_WEBRTC_POLL_MS`` env var. Cheap FFI call; tightens P99 handoff latency. Parallelize WebSocket listener with tube creation: - ``start_websocket_listener`` is now called *before* ``create_tube``, right after ``signal_handler`` is wired to ``tunnel_session``. The Rust tube creation (~500ms) runs in parallel with the WebSocket TLS handshake and router registration instead of serially after. - The listener only reads ``conversation_id`` from ``tunnel_session`` for routing; ``tube_id`` is used only for the thread name and log context, so the temp-UUID-to-real-tube-id swap after create_tube is race-free (the gateway doesn't emit messages until it receives our offer, which happens after the swap). Gateway-offer retry with adaptive backend-delay catch-up: - Unified retry loop (``PAM_GATEWAY_OFFER_MAX_ATTEMPTS``, default 2) wraps ``router_send_action_to_gateway`` for both streaming and non-streaming paths. A local helper ``_send_gateway_offer_with_retry`` replaces two near-identical inline call sites. - On a first-attempt failure that looks transient (``timeout``, ``rrc_timeout``, ``bad_state``, 502/503/504, ``controller_down``), before the retry we sleep ``offer_retry_extra_delay_sec() + (legacy_backend_delay - fast_backend_delay)`` so the cumulative wait matches the pre-change legacy 2.0s behavior for the cold-router case. Fast path stays fast; unlucky launch still gets the full safety window on retry. - New checkpoints ``gateway_offer_backend_catchup_delay_{start,done}`` and ``gateway_offer_http_attempt_{N}`` make the retry path visible in ``PAM_CONNECT_TIMING=1`` output. WebSocket listener checkpoint renamed ``websocket_listener_started`` → ``websocket_listener_started_early`` to reflect its new position in the flow. Verified in QA: happy path ~10.3s (was 12.4s), gateway-offline retry case exercises the full adaptive 2.95s catch-up (1.25s retry + 1.7s backend-delay delta) exactly as designed before re-attempt. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

… listener (#1982) Builds on the PR1 TunnelDAG caching. Measured against the post-PR1 baseline (~12.4s grand total), this brings ``pam launch`` down to ~10.3s (another ~2.1s saved; cumulative ~6.7s vs pre-PR1 baseline). Tune the hardcoded sleeps in the WebRTC tunnel open path, then use those savings in an adaptive-fallback retry so the fast path stays fast but the unlucky first-try-fail path still gets the legacy safety window on the retry. Sleep / polling changes (all env-tunable via the new helpers in ``connect_timing.py``): - ``WEBSOCKET_BACKEND_DELAY`` default 2.0s → 0.30s (router/gateway conversation-registration window). Saves ~1.7s on the happy path. - Hardcoded ``time.sleep(1)`` before the offer POST is replaced with ``pre_offer_delay_sec()``, default 0.0s (the preceding backend delay already covers router registration). Saves ~1.0s. Set ``PAM_PRE_OFFER_LEGACY=1`` to restore the 1.0s wait. - ``PAM_OPEN_CONNECTION_DELAY`` default 0.2s → 0.05s. The existing ``open_handler_connection`` retry loop (exponential backoff) already handles slow DataChannel readiness, so the fixed sleep was mostly redundant. Saves ~150ms. - WebRTC connection-state poll tick 100ms → 25ms via new ``PAM_WEBRTC_POLL_MS`` env var. Cheap FFI call; tightens P99 handoff latency. Parallelize WebSocket listener with tube creation: - ``start_websocket_listener`` is now called *before* ``create_tube``, right after ``signal_handler`` is wired to ``tunnel_session``. The Rust tube creation (~500ms) runs in parallel with the WebSocket TLS handshake and router registration instead of serially after. - The listener only reads ``conversation_id`` from ``tunnel_session`` for routing; ``tube_id`` is used only for the thread name and log context, so the temp-UUID-to-real-tube-id swap after create_tube is race-free (the gateway doesn't emit messages until it receives our offer, which happens after the swap). Gateway-offer retry with adaptive backend-delay catch-up: - Unified retry loop (``PAM_GATEWAY_OFFER_MAX_ATTEMPTS``, default 2) wraps ``router_send_action_to_gateway`` for both streaming and non-streaming paths. A local helper ``_send_gateway_offer_with_retry`` replaces two near-identical inline call sites. - On a first-attempt failure that looks transient (``timeout``, ``rrc_timeout``, ``bad_state``, 502/503/504, ``controller_down``), before the retry we sleep ``offer_retry_extra_delay_sec() + (legacy_backend_delay - fast_backend_delay)`` so the cumulative wait matches the pre-change legacy 2.0s behavior for the cold-router case. Fast path stays fast; unlucky launch still gets the full safety window on retry. - New checkpoints ``gateway_offer_backend_catchup_delay_{start,done}`` and ``gateway_offer_http_attempt_{N}`` make the retry path visible in ``PAM_CONNECT_TIMING=1`` output. WebSocket listener checkpoint renamed ``websocket_listener_started`` → ``websocket_listener_started_early`` to reflect its new position in the flow. Verified in QA: happy path ~10.3s (was 12.4s), gateway-offline retry case exercises the full adaptive 2.95s catch-up (1.25s retry + 1.7s backend-delay delta) exactly as designed before re-attempt. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

idimov-keeper merged commit e9bfce1 into release Apr 22, 2026
4 checks passed

idimov-keeper deleted the pam-launch-faster-tunnel-open branch April 22, 2026 22:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(pam launch): shrink tunnel-open sleeps and parallelize WebSocket…#1982

perf(pam launch): shrink tunnel-open sleeps and parallelize WebSocket…#1982
idimov-keeper merged 1 commit intoreleasefrom
pam-launch-faster-tunnel-open

idimov-keeper commented Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

idimov-keeper commented Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant