Skip to content

perf(pam launch): shrink tunnel-open sleeps and parallelize WebSocket…#1982

Merged
idimov-keeper merged 1 commit intoreleasefrom
pam-launch-faster-tunnel-open
Apr 22, 2026
Merged

perf(pam launch): shrink tunnel-open sleeps and parallelize WebSocket…#1982
idimov-keeper merged 1 commit intoreleasefrom
pam-launch-faster-tunnel-open

Conversation

@idimov-keeper
Copy link
Copy Markdown
Contributor

… listener

Builds on the PR1 TunnelDAG caching. Measured against the post-PR1 baseline (~12.4s grand total), this brings pam launch down to ~10.3s (another ~2.1s saved; cumulative ~6.7s vs pre-PR1 baseline).

Tune the hardcoded sleeps in the WebRTC tunnel open path, then use those savings in an adaptive-fallback retry so the fast path stays fast but the unlucky first-try-fail path still gets the legacy safety window on the retry.

Sleep / polling changes (all env-tunable via the new helpers in connect_timing.py):

  • WEBSOCKET_BACKEND_DELAY default 2.0s → 0.30s (router/gateway conversation-registration window). Saves ~1.7s on the happy path.
  • Hardcoded time.sleep(1) before the offer POST is replaced with pre_offer_delay_sec(), default 0.0s (the preceding backend delay already covers router registration). Saves ~1.0s. Set PAM_PRE_OFFER_LEGACY=1 to restore the 1.0s wait.
  • PAM_OPEN_CONNECTION_DELAY default 0.2s → 0.05s. The existing open_handler_connection retry loop (exponential backoff) already handles slow DataChannel readiness, so the fixed sleep was mostly redundant. Saves ~150ms.
  • WebRTC connection-state poll tick 100ms → 25ms via new PAM_WEBRTC_POLL_MS env var. Cheap FFI call; tightens P99 handoff latency.

Parallelize WebSocket listener with tube creation:

  • start_websocket_listener is now called before create_tube, right after signal_handler is wired to tunnel_session. The Rust tube creation (~500ms) runs in parallel with the WebSocket TLS handshake and router registration instead of serially after.
  • The listener only reads conversation_id from tunnel_session for routing; tube_id is used only for the thread name and log context, so the temp-UUID-to-real-tube-id swap after create_tube is race-free (the gateway doesn't emit messages until it receives our offer, which happens after the swap).

Gateway-offer retry with adaptive backend-delay catch-up:

  • Unified retry loop (PAM_GATEWAY_OFFER_MAX_ATTEMPTS, default 2) wraps router_send_action_to_gateway for both streaming and non-streaming paths. A local helper _send_gateway_offer_with_retry replaces two near-identical inline call sites.
  • On a first-attempt failure that looks transient (timeout, rrc_timeout, bad_state, 502/503/504, controller_down), before the retry we sleep offer_retry_extra_delay_sec() + (legacy_backend_delay - fast_backend_delay) so the cumulative wait matches the pre-change legacy 2.0s behavior for the cold-router case. Fast path stays fast; unlucky launch still gets the full safety window on retry.
  • New checkpoints gateway_offer_backend_catchup_delay_{start,done} and gateway_offer_http_attempt_{N} make the retry path visible in PAM_CONNECT_TIMING=1 output.

WebSocket listener checkpoint renamed
websocket_listener_startedwebsocket_listener_started_early to reflect its new position in the flow.

Verified in QA: happy path ~10.3s (was 12.4s), gateway-offline retry case exercises the full adaptive 2.95s catch-up (1.25s retry + 1.7s backend-delay delta) exactly as designed before re-attempt.

… listener

Builds on the PR1 TunnelDAG caching. Measured against the post-PR1
baseline (~12.4s grand total), this brings ``pam launch`` down to
~10.3s (another ~2.1s saved; cumulative ~6.7s vs pre-PR1 baseline).

Tune the hardcoded sleeps in the WebRTC tunnel open path, then use
those savings in an adaptive-fallback retry so the fast path stays
fast but the unlucky first-try-fail path still gets the legacy
safety window on the retry.

Sleep / polling changes (all env-tunable via the new helpers in
``connect_timing.py``):

- ``WEBSOCKET_BACKEND_DELAY`` default 2.0s → 0.30s (router/gateway
  conversation-registration window). Saves ~1.7s on the happy path.
- Hardcoded ``time.sleep(1)`` before the offer POST is replaced with
  ``pre_offer_delay_sec()``, default 0.0s (the preceding backend
  delay already covers router registration). Saves ~1.0s. Set
  ``PAM_PRE_OFFER_LEGACY=1`` to restore the 1.0s wait.
- ``PAM_OPEN_CONNECTION_DELAY`` default 0.2s → 0.05s. The existing
  ``open_handler_connection`` retry loop (exponential backoff) already
  handles slow DataChannel readiness, so the fixed sleep was mostly
  redundant. Saves ~150ms.
- WebRTC connection-state poll tick 100ms → 25ms via new
  ``PAM_WEBRTC_POLL_MS`` env var. Cheap FFI call; tightens P99
  handoff latency.

Parallelize WebSocket listener with tube creation:

- ``start_websocket_listener`` is now called *before* ``create_tube``,
  right after ``signal_handler`` is wired to ``tunnel_session``. The
  Rust tube creation (~500ms) runs in parallel with the WebSocket
  TLS handshake and router registration instead of serially after.
- The listener only reads ``conversation_id`` from ``tunnel_session``
  for routing; ``tube_id`` is used only for the thread name and log
  context, so the temp-UUID-to-real-tube-id swap after create_tube
  is race-free (the gateway doesn't emit messages until it receives
  our offer, which happens after the swap).

Gateway-offer retry with adaptive backend-delay catch-up:

- Unified retry loop (``PAM_GATEWAY_OFFER_MAX_ATTEMPTS``, default 2)
  wraps ``router_send_action_to_gateway`` for both streaming and
  non-streaming paths. A local helper
  ``_send_gateway_offer_with_retry`` replaces two near-identical
  inline call sites.
- On a first-attempt failure that looks transient (``timeout``,
  ``rrc_timeout``, ``bad_state``, 502/503/504, ``controller_down``),
  before the retry we sleep
  ``offer_retry_extra_delay_sec() + (legacy_backend_delay - fast_backend_delay)``
  so the cumulative wait matches the pre-change legacy 2.0s behavior
  for the cold-router case. Fast path stays fast; unlucky launch
  still gets the full safety window on retry.
- New checkpoints ``gateway_offer_backend_catchup_delay_{start,done}``
  and ``gateway_offer_http_attempt_{N}`` make the retry path
  visible in ``PAM_CONNECT_TIMING=1`` output.

WebSocket listener checkpoint renamed
``websocket_listener_started`` → ``websocket_listener_started_early``
to reflect its new position in the flow.

Verified in QA: happy path ~10.3s (was 12.4s), gateway-offline retry
case exercises the full adaptive 2.95s catch-up (1.25s retry + 1.7s
backend-delay delta) exactly as designed before re-attempt.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@idimov-keeper idimov-keeper merged commit e9bfce1 into release Apr 22, 2026
4 checks passed
@idimov-keeper idimov-keeper deleted the pam-launch-faster-tunnel-open branch April 22, 2026 22:57
sk-keeper pushed a commit that referenced this pull request Apr 24, 2026
… listener (#1982)

Builds on the PR1 TunnelDAG caching. Measured against the post-PR1
baseline (~12.4s grand total), this brings ``pam launch`` down to
~10.3s (another ~2.1s saved; cumulative ~6.7s vs pre-PR1 baseline).

Tune the hardcoded sleeps in the WebRTC tunnel open path, then use
those savings in an adaptive-fallback retry so the fast path stays
fast but the unlucky first-try-fail path still gets the legacy
safety window on the retry.

Sleep / polling changes (all env-tunable via the new helpers in
``connect_timing.py``):

- ``WEBSOCKET_BACKEND_DELAY`` default 2.0s → 0.30s (router/gateway
  conversation-registration window). Saves ~1.7s on the happy path.
- Hardcoded ``time.sleep(1)`` before the offer POST is replaced with
  ``pre_offer_delay_sec()``, default 0.0s (the preceding backend
  delay already covers router registration). Saves ~1.0s. Set
  ``PAM_PRE_OFFER_LEGACY=1`` to restore the 1.0s wait.
- ``PAM_OPEN_CONNECTION_DELAY`` default 0.2s → 0.05s. The existing
  ``open_handler_connection`` retry loop (exponential backoff) already
  handles slow DataChannel readiness, so the fixed sleep was mostly
  redundant. Saves ~150ms.
- WebRTC connection-state poll tick 100ms → 25ms via new
  ``PAM_WEBRTC_POLL_MS`` env var. Cheap FFI call; tightens P99
  handoff latency.

Parallelize WebSocket listener with tube creation:

- ``start_websocket_listener`` is now called *before* ``create_tube``,
  right after ``signal_handler`` is wired to ``tunnel_session``. The
  Rust tube creation (~500ms) runs in parallel with the WebSocket
  TLS handshake and router registration instead of serially after.
- The listener only reads ``conversation_id`` from ``tunnel_session``
  for routing; ``tube_id`` is used only for the thread name and log
  context, so the temp-UUID-to-real-tube-id swap after create_tube
  is race-free (the gateway doesn't emit messages until it receives
  our offer, which happens after the swap).

Gateway-offer retry with adaptive backend-delay catch-up:

- Unified retry loop (``PAM_GATEWAY_OFFER_MAX_ATTEMPTS``, default 2)
  wraps ``router_send_action_to_gateway`` for both streaming and
  non-streaming paths. A local helper
  ``_send_gateway_offer_with_retry`` replaces two near-identical
  inline call sites.
- On a first-attempt failure that looks transient (``timeout``,
  ``rrc_timeout``, ``bad_state``, 502/503/504, ``controller_down``),
  before the retry we sleep
  ``offer_retry_extra_delay_sec() + (legacy_backend_delay - fast_backend_delay)``
  so the cumulative wait matches the pre-change legacy 2.0s behavior
  for the cold-router case. Fast path stays fast; unlucky launch
  still gets the full safety window on retry.
- New checkpoints ``gateway_offer_backend_catchup_delay_{start,done}``
  and ``gateway_offer_http_attempt_{N}`` make the retry path
  visible in ``PAM_CONNECT_TIMING=1`` output.

WebSocket listener checkpoint renamed
``websocket_listener_started`` → ``websocket_listener_started_early``
to reflect its new position in the flow.

Verified in QA: happy path ~10.3s (was 12.4s), gateway-offline retry
case exercises the full adaptive 2.95s catch-up (1.25s retry + 1.7s
backend-delay delta) exactly as designed before re-attempt.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant