perf(pam launch): shrink tunnel-open sleeps and parallelize WebSocket…#1982
Merged
idimov-keeper merged 1 commit intoreleasefrom Apr 22, 2026
Merged
perf(pam launch): shrink tunnel-open sleeps and parallelize WebSocket…#1982idimov-keeper merged 1 commit intoreleasefrom
idimov-keeper merged 1 commit intoreleasefrom
Conversation
… listener
Builds on the PR1 TunnelDAG caching. Measured against the post-PR1
baseline (~12.4s grand total), this brings ``pam launch`` down to
~10.3s (another ~2.1s saved; cumulative ~6.7s vs pre-PR1 baseline).
Tune the hardcoded sleeps in the WebRTC tunnel open path, then use
those savings in an adaptive-fallback retry so the fast path stays
fast but the unlucky first-try-fail path still gets the legacy
safety window on the retry.
Sleep / polling changes (all env-tunable via the new helpers in
``connect_timing.py``):
- ``WEBSOCKET_BACKEND_DELAY`` default 2.0s → 0.30s (router/gateway
conversation-registration window). Saves ~1.7s on the happy path.
- Hardcoded ``time.sleep(1)`` before the offer POST is replaced with
``pre_offer_delay_sec()``, default 0.0s (the preceding backend
delay already covers router registration). Saves ~1.0s. Set
``PAM_PRE_OFFER_LEGACY=1`` to restore the 1.0s wait.
- ``PAM_OPEN_CONNECTION_DELAY`` default 0.2s → 0.05s. The existing
``open_handler_connection`` retry loop (exponential backoff) already
handles slow DataChannel readiness, so the fixed sleep was mostly
redundant. Saves ~150ms.
- WebRTC connection-state poll tick 100ms → 25ms via new
``PAM_WEBRTC_POLL_MS`` env var. Cheap FFI call; tightens P99
handoff latency.
Parallelize WebSocket listener with tube creation:
- ``start_websocket_listener`` is now called *before* ``create_tube``,
right after ``signal_handler`` is wired to ``tunnel_session``. The
Rust tube creation (~500ms) runs in parallel with the WebSocket
TLS handshake and router registration instead of serially after.
- The listener only reads ``conversation_id`` from ``tunnel_session``
for routing; ``tube_id`` is used only for the thread name and log
context, so the temp-UUID-to-real-tube-id swap after create_tube
is race-free (the gateway doesn't emit messages until it receives
our offer, which happens after the swap).
Gateway-offer retry with adaptive backend-delay catch-up:
- Unified retry loop (``PAM_GATEWAY_OFFER_MAX_ATTEMPTS``, default 2)
wraps ``router_send_action_to_gateway`` for both streaming and
non-streaming paths. A local helper
``_send_gateway_offer_with_retry`` replaces two near-identical
inline call sites.
- On a first-attempt failure that looks transient (``timeout``,
``rrc_timeout``, ``bad_state``, 502/503/504, ``controller_down``),
before the retry we sleep
``offer_retry_extra_delay_sec() + (legacy_backend_delay - fast_backend_delay)``
so the cumulative wait matches the pre-change legacy 2.0s behavior
for the cold-router case. Fast path stays fast; unlucky launch
still gets the full safety window on retry.
- New checkpoints ``gateway_offer_backend_catchup_delay_{start,done}``
and ``gateway_offer_http_attempt_{N}`` make the retry path
visible in ``PAM_CONNECT_TIMING=1`` output.
WebSocket listener checkpoint renamed
``websocket_listener_started`` → ``websocket_listener_started_early``
to reflect its new position in the flow.
Verified in QA: happy path ~10.3s (was 12.4s), gateway-offline retry
case exercises the full adaptive 2.95s catch-up (1.25s retry + 1.7s
backend-delay delta) exactly as designed before re-attempt.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
sk-keeper
pushed a commit
that referenced
this pull request
Apr 24, 2026
… listener (#1982) Builds on the PR1 TunnelDAG caching. Measured against the post-PR1 baseline (~12.4s grand total), this brings ``pam launch`` down to ~10.3s (another ~2.1s saved; cumulative ~6.7s vs pre-PR1 baseline). Tune the hardcoded sleeps in the WebRTC tunnel open path, then use those savings in an adaptive-fallback retry so the fast path stays fast but the unlucky first-try-fail path still gets the legacy safety window on the retry. Sleep / polling changes (all env-tunable via the new helpers in ``connect_timing.py``): - ``WEBSOCKET_BACKEND_DELAY`` default 2.0s → 0.30s (router/gateway conversation-registration window). Saves ~1.7s on the happy path. - Hardcoded ``time.sleep(1)`` before the offer POST is replaced with ``pre_offer_delay_sec()``, default 0.0s (the preceding backend delay already covers router registration). Saves ~1.0s. Set ``PAM_PRE_OFFER_LEGACY=1`` to restore the 1.0s wait. - ``PAM_OPEN_CONNECTION_DELAY`` default 0.2s → 0.05s. The existing ``open_handler_connection`` retry loop (exponential backoff) already handles slow DataChannel readiness, so the fixed sleep was mostly redundant. Saves ~150ms. - WebRTC connection-state poll tick 100ms → 25ms via new ``PAM_WEBRTC_POLL_MS`` env var. Cheap FFI call; tightens P99 handoff latency. Parallelize WebSocket listener with tube creation: - ``start_websocket_listener`` is now called *before* ``create_tube``, right after ``signal_handler`` is wired to ``tunnel_session``. The Rust tube creation (~500ms) runs in parallel with the WebSocket TLS handshake and router registration instead of serially after. - The listener only reads ``conversation_id`` from ``tunnel_session`` for routing; ``tube_id`` is used only for the thread name and log context, so the temp-UUID-to-real-tube-id swap after create_tube is race-free (the gateway doesn't emit messages until it receives our offer, which happens after the swap). Gateway-offer retry with adaptive backend-delay catch-up: - Unified retry loop (``PAM_GATEWAY_OFFER_MAX_ATTEMPTS``, default 2) wraps ``router_send_action_to_gateway`` for both streaming and non-streaming paths. A local helper ``_send_gateway_offer_with_retry`` replaces two near-identical inline call sites. - On a first-attempt failure that looks transient (``timeout``, ``rrc_timeout``, ``bad_state``, 502/503/504, ``controller_down``), before the retry we sleep ``offer_retry_extra_delay_sec() + (legacy_backend_delay - fast_backend_delay)`` so the cumulative wait matches the pre-change legacy 2.0s behavior for the cold-router case. Fast path stays fast; unlucky launch still gets the full safety window on retry. - New checkpoints ``gateway_offer_backend_catchup_delay_{start,done}`` and ``gateway_offer_http_attempt_{N}`` make the retry path visible in ``PAM_CONNECT_TIMING=1`` output. WebSocket listener checkpoint renamed ``websocket_listener_started`` → ``websocket_listener_started_early`` to reflect its new position in the flow. Verified in QA: happy path ~10.3s (was 12.4s), gateway-offline retry case exercises the full adaptive 2.95s catch-up (1.25s retry + 1.7s backend-delay delta) exactly as designed before re-attempt. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
… listener
Builds on the PR1 TunnelDAG caching. Measured against the post-PR1 baseline (~12.4s grand total), this brings
pam launchdown to ~10.3s (another ~2.1s saved; cumulative ~6.7s vs pre-PR1 baseline).Tune the hardcoded sleeps in the WebRTC tunnel open path, then use those savings in an adaptive-fallback retry so the fast path stays fast but the unlucky first-try-fail path still gets the legacy safety window on the retry.
Sleep / polling changes (all env-tunable via the new helpers in
connect_timing.py):WEBSOCKET_BACKEND_DELAYdefault 2.0s → 0.30s (router/gateway conversation-registration window). Saves ~1.7s on the happy path.time.sleep(1)before the offer POST is replaced withpre_offer_delay_sec(), default 0.0s (the preceding backend delay already covers router registration). Saves ~1.0s. SetPAM_PRE_OFFER_LEGACY=1to restore the 1.0s wait.PAM_OPEN_CONNECTION_DELAYdefault 0.2s → 0.05s. The existingopen_handler_connectionretry loop (exponential backoff) already handles slow DataChannel readiness, so the fixed sleep was mostly redundant. Saves ~150ms.PAM_WEBRTC_POLL_MSenv var. Cheap FFI call; tightens P99 handoff latency.Parallelize WebSocket listener with tube creation:
start_websocket_listeneris now called beforecreate_tube, right aftersignal_handleris wired totunnel_session. The Rust tube creation (~500ms) runs in parallel with the WebSocket TLS handshake and router registration instead of serially after.conversation_idfromtunnel_sessionfor routing;tube_idis used only for the thread name and log context, so the temp-UUID-to-real-tube-id swap after create_tube is race-free (the gateway doesn't emit messages until it receives our offer, which happens after the swap).Gateway-offer retry with adaptive backend-delay catch-up:
PAM_GATEWAY_OFFER_MAX_ATTEMPTS, default 2) wrapsrouter_send_action_to_gatewayfor both streaming and non-streaming paths. A local helper_send_gateway_offer_with_retryreplaces two near-identical inline call sites.timeout,rrc_timeout,bad_state, 502/503/504,controller_down), before the retry we sleepoffer_retry_extra_delay_sec() + (legacy_backend_delay - fast_backend_delay)so the cumulative wait matches the pre-change legacy 2.0s behavior for the cold-router case. Fast path stays fast; unlucky launch still gets the full safety window on retry.gateway_offer_backend_catchup_delay_{start,done}andgateway_offer_http_attempt_{N}make the retry path visible inPAM_CONNECT_TIMING=1output.WebSocket listener checkpoint renamed
websocket_listener_started→websocket_listener_started_earlyto reflect its new position in the flow.Verified in QA: happy path ~10.3s (was 12.4s), gateway-offline retry case exercises the full adaptive 2.95s catch-up (1.25s retry + 1.7s backend-delay delta) exactly as designed before re-attempt.