Skip to content

fix(gateway): keep auto-pair watcher alive for late OpenClaw CLI scope upgrades (#4263)#4292

Merged
cv merged 3 commits into
NVIDIA:mainfrom
yimoj:fix/4263-multisandbox-autopair
May 27, 2026
Merged

fix(gateway): keep auto-pair watcher alive for late OpenClaw CLI scope upgrades (#4263)#4292
cv merged 3 commits into
NVIDIA:mainfrom
yimoj:fix/4263-multisandbox-autopair

Conversation

@yimoj
Copy link
Copy Markdown
Contributor

@yimoj yimoj commented May 27, 2026

Summary

Fixes NemoClaw#4263. The in-sandbox auto-pair watcher used to exit as soon as browser or any device pairing converged (originally added in NemoClaw#2484 to avoid gateway connect-handler saturation). When an OpenClaw CLI agent later requested an additional scope, the gateway held the upgrade as pending approval forever and the agent fell back to embedded mode. On the shared multi-sandbox gateway both sandboxes hit this consistently after the watcher quiet-exit.

Related Issue

Fixes #4263.

Changes

  • scripts/nemoclaw-start.sh auto-pair watcher: replaces convergence-break with a slow-mode keepalive. After convergence (or after NEMOCLAW_AUTO_PAIR_FAST_DEADLINE_SECS, default 600s), the watcher transitions to slow polling at NEMOCLAW_AUTO_PAIR_SLOW_INTERVAL_SECS (default 30s) up to NEMOCLAW_AUTO_PAIR_DEADLINE_SECS (default 8h) and keeps approving allowlisted openclaw-control-ui / webchat / cli scope upgrades. The fast-deadline transition is evaluated before the pending-branch continue, so a sticky pending request (rejected unknown client, permanent approve failure) cannot hold the watcher at 1s polling for the full deadline — avoiding a regression of the NemoClaw#2484 connect-handler pile-up on a longer timeline.
  • src/lib/actions/sandbox/connect.ts adds a one-shot defense-in-depth approval pass before the SSH handoff. It sources /tmp/nemoclaw-proxy-env.sh, applies the same allowlist, caps at MAX_APPROVALS = 8 with an outer 12s timeout (sized to cover 2s list + 1s × MAX_APPROVALS plus shell/python startup slack), and swallows every failure so it never blocks connect. This catches the case where the watcher crashed or exhausted its deadline between sandbox start and the user's first connect.
  • Tests cover: slow-mode keepalive approval of late scope upgrades, rejection of unknown clients in slow mode, the fast-deadline fallback with no convergence signal, the new sticky-pending regression (codex-review surfaced), and the connect-time approval pass (happy path + tolerance to failure).

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • npm test for the touched test files passes (test/nemoclaw-start.test.ts, test/sandbox-connect-inference.test.ts: 103/103)
  • npm run typecheck:cli passes
  • Tests added for slow-mode keepalive, fast-deadline transition, sticky-pending regression, and connect-time approval pass
  • No secrets, API keys, or credentials committed
  • Docs: not user-facing; behavior is the in-sandbox watcher and connect-time approval pass.

Test plan

  • npx vitest run test/nemoclaw-start.test.ts -t "auto-pair" — 5 tests pass
  • npx vitest run test/sandbox-connect-inference.test.ts -t "#4263" — 2 tests pass
  • npx vitest run test/nemoclaw-start.test.ts test/sandbox-connect-inference.test.ts — 103 tests pass
  • npm run typecheck:cli — clean
  • Live multi-sandbox repro on a Brev/OpenShell host (not attempted from triage workspace; the targeted tests cover the watcher state machine and connect-time exec path)

Signed-off-by: Yimo Jiang yimoj@nvidia.com

Summary by CodeRabbit

  • New Features

    • Automatic device pairing for allowlisted clients during sandbox connect; approvals capped and continued in slow-mode after convergence
  • Bug Fixes

    • Approval pass is best-effort and won’t block sandbox connect if it fails
    • Per-call pairing operations are time-bounded to avoid hanging the watcher
    • Watcher timing, cadence and final log updated to reflect deadline-driven exit behavior
  • Tests

    • Added tests for slow-mode keepalive, approval-pass behavior, timeouts, retries, and regressions

Review Change Stack

…e upgrades

The in-sandbox auto-pair watcher exited as soon as browser or any
device pairing converged (NemoClaw#2484 fix). When an OpenClaw CLI
agent later requested an additional scope, the gateway held the
upgrade as pending approval forever and the agent fell back to
embedded mode (NemoClaw#4263). On the shared multi-sandbox gateway
both sandboxes hit this consistently after the watcher quiet-exit.

Instead of exiting on convergence, the watcher now transitions to a
slow-mode keepalive: it keeps polling `openclaw devices list` at a
30s cadence for up to NEMOCLAW_AUTO_PAIR_DEADLINE_SECS (default 8h)
and approves any allowlisted `openclaw-control-ui` / webchat / cli
scope upgrade that arrives. The 1s fast-mode cadence is preserved
for the first NEMOCLAW_AUTO_PAIR_FAST_DEADLINE_SECS (default 600s)
to keep first-pair latency low, and the fast-deadline transition is
evaluated before the pending-branch continue so a sticky pending
request (rejected unknown client, permanent approve failure) cannot
hold the watcher at 1s polling for the full deadline — avoiding a
regression of the NemoClaw#2484 connect-handler pile-up.

A defense-in-depth one-shot approval pass also runs in
`nemoclaw <sandbox> connect` before the SSH handoff: it sources
/tmp/nemoclaw-proxy-env.sh, applies the same allowlist, caps at 8
approvals with a 12s outer timeout (sized to cover 2s list + 1s ×
MAX_APPROVALS), and swallows every failure so it never blocks
connect. This catches the case where the watcher has crashed or
exhausted its deadline between sandbox start and the user's first
connect (e.g., long-running multi-sandbox gateway contention).

Tests cover slow-mode keepalive approval of late scope upgrades,
rejection of unknown clients in slow mode, the fast-deadline
fallback, the sticky-pending regression, and the connect-time
approval pass (both happy path and tolerance to failure).

Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 27, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 76572f24-bd96-4130-8f17-d4454fc4366a

📥 Commits

Reviewing files that changed from the base of the PR and between 7276f6c and 7499fb7.

📒 Files selected for processing (2)
  • scripts/nemoclaw-start.sh
  • test/nemoclaw-start.test.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • scripts/nemoclaw-start.sh

📝 Walkthrough

Walkthrough

Refactors the auto-pair watcher to configurable fast/slow deadlines with per-call timeouts and slow-mode keepalive approvals; adds a best-effort, timeout-bounded in-sandbox auto-pair approval pass invoked during sandbox connect; expands tests for integration, timing, and timeout regressions.

Changes

Auto-Pair Approval and Watcher Keep-Alive

Layer / File(s) Summary
Watcher timing model and convergence control flow
scripts/nemoclaw-start.sh
Watcher preamble gains _env_seconds helper and configurable NEMOCLAW_AUTO_PAIR_FAST_DEADLINE_SECS, NEMOCLAW_AUTO_PAIR_DEADLINE_SECS, NEMOCLAW_AUTO_PAIR_SLOW_INTERVAL_SECS. Loop reorders fast→slow transition check before pending-request handling and switches convergence behavior from exiting to entering slow-mode keepalive until overall deadline.
Watcher run wrapper and slow-mode cadence
scripts/nemoclaw-start.sh
Adds run() wrapper to bound openclaw CLI calls with NEMOCLAW_AUTO_PAIR_RUN_TIMEOUT_SECS and returns a timeout sentinel; loop uses slow-mode intervals after convergence and logs "watcher deadline reached" on overall timeout.
Pending handling and convergence change
scripts/nemoclaw-start.sh
Pending-device approval loop treats timeout sentinel as retryable (do not mark handled), enables SLOW_MODE on convergence instead of breaking, and continues until DEADLINE with slow-interval sleeps.
In-sandbox auto-pair approval helper
src/lib/actions/sandbox/connect.ts
New one-shot helper executed via spawnSync inside sandbox: runs timeout-bounded openclaw devices list --json, allowlists openclaw-control-ui clients and webchat/cli modes, approves up to a capped number of unique request IDs with openclaw devices approve, and swallows errors/timeouts so connect is not blocked.
Connect flow auto-pair approval call
src/lib/actions/sandbox/connect.ts
Approval helper invoked in connectSandbox after inference route setup and before the pre-connect hint / SSH handoff so approvals are attempted but remain best-effort.
Connect approval-pass integration tests
test/sandbox-connect-inference.test.ts
Adds tests asserting sandbox connect invokes a bounded sandbox exec approval-pass script with hardened allowlist/approve steps and verifies connect still succeeds when the approval-pass exec fails.
Watcher keep-alive test suite
test/nemoclaw-start.test.ts
Updates existing auto-pair test to cap slow-mode timing and expected output; adds multiple slow-mode keepalive scenarios validating late CLI upgrades after convergence, unknown-client rejection, fast→slow fallback without convergence, sticky-pending regression, per-call timeout bounding for wedged CLI calls, and approve-retry-after-timeout regression.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • NVIDIA/NemoClaw#4022: Modifies src/lib/actions/sandbox/connect.ts connect flow; likely related to connect/inference readiness and polling behavior.

Suggested labels

fix, Integration: OpenClaw, NemoClaw CLI, OpenShell, enhancement: testing

Suggested reviewers

  • cv
  • jyaunches
  • ericksoa

Poem

🐰 I hopped through watchers, timers tuned,
I bounded calls so none are ruined,
I whispered "approve" to trusted names,
kept slow-mode safe and tamed the games,
Now sandboxes pair without alarm. 🥕✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: fixing the auto-pair watcher to remain alive for late OpenClaw CLI scope upgrades instead of exiting prematurely after convergence.
Linked Issues check ✅ Passed The PR comprehensively addresses all objectives from issue #4263: watcher keepalive for late scope upgrades, defense-in-depth connect-time approval pass, and proper retry logic for transient timeouts.
Out of Scope Changes check ✅ Passed All changes directly support the core objective of fixing auto-pair watcher behavior and adding defense-in-depth approval logic; no extraneous modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
scripts/nemoclaw-start.sh (1)

1400-1446: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add a timeout around the openclaw CLI calls.

The watcher now stays alive for hours, but both devices list and devices approve still run through an unbounded subprocess.run. One stuck CLI call will pin the watcher forever, so it never reaches either the fast→slow transition or the overall deadline this change introduces.

Suggested hardening
 def run(*args):
-    proc = subprocess.run(args, capture_output=True, text=True)
-    return proc.returncode, proc.stdout.strip(), proc.stderr.strip()
+    try:
+        proc = subprocess.run(args, capture_output=True, text=True, timeout=10)
+        return proc.returncode, proc.stdout.strip(), proc.stderr.strip()
+    except subprocess.TimeoutExpired as exc:
+        return 124, (exc.stdout or "").strip(), (exc.stderr or "").strip()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/nemoclaw-start.sh` around lines 1400 - 1446, The openclaw CLI
invocations (calls to run with OPENCLAW for 'devices list --json' and 'devices
approve <id> --json') use an unbounded subprocess.run and can hang forever; add
a reasonable timeout (e.g. a few seconds) to those calls by updating the run
invocation or its implementation to accept a timeout parameter, catch
subprocess.TimeoutExpired, kill/cleanup the child, and treat the call as a
failure (log the timeout and continue/retry) so the watcher can progress to the
FAST_DEADLINE/SLOW_MODE transitions and overall DEADLINE; update handling around
run(...) for both the list and approve branches to handle timeout errors
similarly and include short logs showing the request_id or that list timed out.
🧹 Nitpick comments (1)
scripts/nemoclaw-start.sh (1)

1345-1491: Please exercise the sandbox-entrypoint E2Es for this watcher change.

This code changes long-lived boot/process-lifecycle behavior in scripts/nemoclaw-start.sh, which unit tests here do not fully represent.

gh workflow run nightly-e2e.yaml --ref <branch> -f jobs=sandbox-survival-e2e,sandbox-operations-e2e,cloud-e2e,openclaw-slack-pairing-e2e

As per coding guidelines, "scripts/nemoclaw-start.sh: This file is a sandbox entrypoint script. Changes affect every sandbox boot and are invisible to unit tests."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/nemoclaw-start.sh` around lines 1345 - 1491, The change to the
long-lived auto-pair watcher in start_auto_pair (embedded Python block using
OPENCLAW, FAST_DEADLINE, DEADLINE, SLOW_INTERVAL, etc.) affects sandbox
boot/process lifecycle and must be validated with sandbox entrypoint E2Es; run
the nightly E2E suite targeting the sandbox survival/operations, cloud, and
openclaw slack pairing jobs (use the provided gh workflow command) and verify
the watcher behavior (fast->slow transition, pending approvals, HANDLED logic,
and PID handling) across sandbox reboots and long-lived sessions, then fix any
failures (adjust timing/env vars, ensure STEP_DOWN_PREFIX_SANDBOX interaction,
and that AUTO_PAIR_PID is set/exported) before merging.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/sandbox-connect-inference.test.ts`:
- Around line 1219-1223: The test currently never exercises the failure path
because setupFixture() makes non-probe "sandbox exec" succeed; modify the test
around runConnect(tmpDir, sandboxName) to force the approval-pass exec to fail
or hang (e.g., stub or configure the fake openshell so the specific
approval-check command returns non-zero or times out) while leaving the probe
behavior unchanged, then call runConnect(tmpDir, sandboxName) and assert
result.status is still 0 to verify "sandbox connect" proceeds despite
approval-pass failure; target the harness that simulates "sandbox exec" (the
fixture stub used by setupFixture) and tweak its behavior for the approval-pass
invocation only.

---

Outside diff comments:
In `@scripts/nemoclaw-start.sh`:
- Around line 1400-1446: The openclaw CLI invocations (calls to run with
OPENCLAW for 'devices list --json' and 'devices approve <id> --json') use an
unbounded subprocess.run and can hang forever; add a reasonable timeout (e.g. a
few seconds) to those calls by updating the run invocation or its implementation
to accept a timeout parameter, catch subprocess.TimeoutExpired, kill/cleanup the
child, and treat the call as a failure (log the timeout and continue/retry) so
the watcher can progress to the FAST_DEADLINE/SLOW_MODE transitions and overall
DEADLINE; update handling around run(...) for both the list and approve branches
to handle timeout errors similarly and include short logs showing the request_id
or that list timed out.

---

Nitpick comments:
In `@scripts/nemoclaw-start.sh`:
- Around line 1345-1491: The change to the long-lived auto-pair watcher in
start_auto_pair (embedded Python block using OPENCLAW, FAST_DEADLINE, DEADLINE,
SLOW_INTERVAL, etc.) affects sandbox boot/process lifecycle and must be
validated with sandbox entrypoint E2Es; run the nightly E2E suite targeting the
sandbox survival/operations, cloud, and openclaw slack pairing jobs (use the
provided gh workflow command) and verify the watcher behavior (fast->slow
transition, pending approvals, HANDLED logic, and PID handling) across sandbox
reboots and long-lived sessions, then fix any failures (adjust timing/env vars,
ensure STEP_DOWN_PREFIX_SANDBOX interaction, and that AUTO_PAIR_PID is
set/exported) before merging.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 489abac7-9e0c-4bda-aadc-267b1652252e

📥 Commits

Reviewing files that changed from the base of the PR and between a784f47 and c5ef4bd.

📒 Files selected for processing (4)
  • scripts/nemoclaw-start.sh
  • src/lib/actions/sandbox/connect.ts
  • test/nemoclaw-start.test.ts
  • test/sandbox-connect-inference.test.ts

Comment thread test/sandbox-connect-inference.test.ts Outdated
…ailure

CodeRabbit feedback on NVIDIA#4292:

1. The auto-pair watcher's `run()` helper called `subprocess.run` with no
   timeout, so a wedged `openclaw devices list`/`approve` child could
   pin the watcher past the fast→slow transition and the 8h DEADLINE.
   Add a NEMOCLAW_AUTO_PAIR_RUN_TIMEOUT_SECS-bounded timeout (default
   10s) and treat the timeout as a transient failure (rc=124) so the
   watcher loop continues. New test exercises a fake openclaw that
   sleeps longer than the per-call timeout and asserts the watcher
   reaches its deadline rather than blocking on the wedged child.

2. The connect-time approval-pass tolerance test previously only
   re-checked the happy path because the fake openshell returns 0 for
   any non-probe `sandbox exec`. Add a NEMOCLAW_TEST_FAIL_APPROVAL_PASS
   hook in the fixture that makes the approval-pass exec exit non-zero
   only when its argv contains the `openclaw devices approve`
   signature, and update the tolerance test to set the hook, assert
   that the approval-pass exec was attempted, and that SSH handoff
   still happens despite the failure.

Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scripts/nemoclaw-start.sh`:
- Around line 1396-1413: The run() helper converts timeouts into return code
124, but the caller that appends the requestId to HANDLED is adding it
unconditionally; change that caller to detect the timeout sentinel (return code
124 from run()) and do not add the requestId to HANDLED in that case so
transient timeouts can be retried; specifically, in the code that calls run(...)
for the "devices approve" flow (the same scope that currently pushes requestId
into HANDLED), only add requestId to HANDLED when the returned rc is not 124
(e.g., rc == 0 or other non-timeout errors can be handled as before).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 291ddd87-8c1e-41a2-9ceb-01f9d975e6f0

📥 Commits

Reviewing files that changed from the base of the PR and between c5ef4bd and 7276f6c.

📒 Files selected for processing (3)
  • scripts/nemoclaw-start.sh
  • test/nemoclaw-start.test.ts
  • test/sandbox-connect-inference.test.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/sandbox-connect-inference.test.ts

Comment thread scripts/nemoclaw-start.sh
CodeRabbit feedback on NVIDIA#4292: the watcher's `run()` helper returns rc=124
on timeout, but the approve caller still added the requestId to HANDLED
unconditionally. A transient timeout would therefore strand the late
scope upgrade the watcher is supposed to recover.

Detect the rc=124 timeout sentinel and skip the HANDLED.add() in that
case so the next poll retries. Permanent failures (other non-zero rc)
still get HANDLED so we don't spin on a stuck bad request.

New regression test uses a fake openclaw whose first
`devices approve flaky-cli` call hangs past the per-call timeout and
whose second succeeds, then asserts: the timeout was logged, the retry
succeeded, and the approve log records exactly one approval.

Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
@yimoj
Copy link
Copy Markdown
Contributor Author

yimoj commented May 27, 2026

CodeRabbit findings resolution status

For reviewer convenience — mapping each CodeRabbit inline finding from the earlier reviews to the commit that resolved it. The most recent CodeRabbit review (Run ID 76572f24-bd96-4130-8f17-d4454fc4366a, reviewing 7276f6cd..7499fb72) reported "No actionable comments were generated in the recent review."

Review Finding Resolving commit
4369346878 (c5ef4bd2) scripts/nemoclaw-start.sh:1400-1446Add a timeout around the openclaw CLI calls 7276f6cd adds RUN_TIMEOUT_SECS (default 10s, env-overridable) and wraps subprocess.run with try/except subprocess.TimeoutExpired, returning rc=124.
4369346878 (c5ef4bd2) test/sandbox-connect-inference.test.ts:1219-1223Test never exercises the failure path 7276f6cd adds the NEMOCLAW_TEST_FAIL_APPROVAL_PASS hook to the fake openshell so the approval-pass exec exits non-zero on demand, and updates the tolerance test to assert SSH handoff still proceeds.
4369403304 (7276f6cd) scripts/nemoclaw-start.sh:1396-1413Don't turn approve timeouts into permanently handled requests 7499fb72 detects rc=124 in the approve caller and skips HANDLED.add(), with a new regression test that uses a fake openclaw whose first approve hangs (timeout) and second succeeds.

CodeRabbit's overall check status is currently SUCCESS. The remaining in-progress checks are CI test/build jobs (checks, unit-vitest-linux, build-sandbox-images), no failures so far.

@yimoj yimoj added the v0.0.53 Release target label May 27, 2026
@cv cv merged commit f967876 into NVIDIA:main May 27, 2026
29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

v0.0.53 Release target

Projects

None yet

3 participants