fix(gateway): keep auto-pair watcher alive for late OpenClaw CLI scope upgrades (#4263)#4292
Conversation
…e upgrades The in-sandbox auto-pair watcher exited as soon as browser or any device pairing converged (NemoClaw#2484 fix). When an OpenClaw CLI agent later requested an additional scope, the gateway held the upgrade as pending approval forever and the agent fell back to embedded mode (NemoClaw#4263). On the shared multi-sandbox gateway both sandboxes hit this consistently after the watcher quiet-exit. Instead of exiting on convergence, the watcher now transitions to a slow-mode keepalive: it keeps polling `openclaw devices list` at a 30s cadence for up to NEMOCLAW_AUTO_PAIR_DEADLINE_SECS (default 8h) and approves any allowlisted `openclaw-control-ui` / webchat / cli scope upgrade that arrives. The 1s fast-mode cadence is preserved for the first NEMOCLAW_AUTO_PAIR_FAST_DEADLINE_SECS (default 600s) to keep first-pair latency low, and the fast-deadline transition is evaluated before the pending-branch continue so a sticky pending request (rejected unknown client, permanent approve failure) cannot hold the watcher at 1s polling for the full deadline — avoiding a regression of the NemoClaw#2484 connect-handler pile-up. A defense-in-depth one-shot approval pass also runs in `nemoclaw <sandbox> connect` before the SSH handoff: it sources /tmp/nemoclaw-proxy-env.sh, applies the same allowlist, caps at 8 approvals with a 12s outer timeout (sized to cover 2s list + 1s × MAX_APPROVALS), and swallows every failure so it never blocks connect. This catches the case where the watcher has crashed or exhausted its deadline between sandbox start and the user's first connect (e.g., long-running multi-sandbox gateway contention). Tests cover slow-mode keepalive approval of late scope upgrades, rejection of unknown clients in slow mode, the fast-deadline fallback, the sticky-pending regression, and the connect-time approval pass (both happy path and tolerance to failure). Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughRefactors the auto-pair watcher to configurable fast/slow deadlines with per-call timeouts and slow-mode keepalive approvals; adds a best-effort, timeout-bounded in-sandbox auto-pair approval pass invoked during sandbox connect; expands tests for integration, timing, and timeout regressions. ChangesAuto-Pair Approval and Watcher Keep-Alive
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
scripts/nemoclaw-start.sh (1)
1400-1446:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAdd a timeout around the
openclawCLI calls.The watcher now stays alive for hours, but both
devices listanddevices approvestill run through an unboundedsubprocess.run. One stuck CLI call will pin the watcher forever, so it never reaches either the fast→slow transition or the overall deadline this change introduces.Suggested hardening
def run(*args): - proc = subprocess.run(args, capture_output=True, text=True) - return proc.returncode, proc.stdout.strip(), proc.stderr.strip() + try: + proc = subprocess.run(args, capture_output=True, text=True, timeout=10) + return proc.returncode, proc.stdout.strip(), proc.stderr.strip() + except subprocess.TimeoutExpired as exc: + return 124, (exc.stdout or "").strip(), (exc.stderr or "").strip()🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/nemoclaw-start.sh` around lines 1400 - 1446, The openclaw CLI invocations (calls to run with OPENCLAW for 'devices list --json' and 'devices approve <id> --json') use an unbounded subprocess.run and can hang forever; add a reasonable timeout (e.g. a few seconds) to those calls by updating the run invocation or its implementation to accept a timeout parameter, catch subprocess.TimeoutExpired, kill/cleanup the child, and treat the call as a failure (log the timeout and continue/retry) so the watcher can progress to the FAST_DEADLINE/SLOW_MODE transitions and overall DEADLINE; update handling around run(...) for both the list and approve branches to handle timeout errors similarly and include short logs showing the request_id or that list timed out.
🧹 Nitpick comments (1)
scripts/nemoclaw-start.sh (1)
1345-1491: Please exercise the sandbox-entrypoint E2Es for this watcher change.This code changes long-lived boot/process-lifecycle behavior in
scripts/nemoclaw-start.sh, which unit tests here do not fully represent.gh workflow run nightly-e2e.yaml --ref <branch> -f jobs=sandbox-survival-e2e,sandbox-operations-e2e,cloud-e2e,openclaw-slack-pairing-e2eAs per coding guidelines, "
scripts/nemoclaw-start.sh: This file is a sandbox entrypoint script. Changes affect every sandbox boot and are invisible to unit tests."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/nemoclaw-start.sh` around lines 1345 - 1491, The change to the long-lived auto-pair watcher in start_auto_pair (embedded Python block using OPENCLAW, FAST_DEADLINE, DEADLINE, SLOW_INTERVAL, etc.) affects sandbox boot/process lifecycle and must be validated with sandbox entrypoint E2Es; run the nightly E2E suite targeting the sandbox survival/operations, cloud, and openclaw slack pairing jobs (use the provided gh workflow command) and verify the watcher behavior (fast->slow transition, pending approvals, HANDLED logic, and PID handling) across sandbox reboots and long-lived sessions, then fix any failures (adjust timing/env vars, ensure STEP_DOWN_PREFIX_SANDBOX interaction, and that AUTO_PAIR_PID is set/exported) before merging.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/sandbox-connect-inference.test.ts`:
- Around line 1219-1223: The test currently never exercises the failure path
because setupFixture() makes non-probe "sandbox exec" succeed; modify the test
around runConnect(tmpDir, sandboxName) to force the approval-pass exec to fail
or hang (e.g., stub or configure the fake openshell so the specific
approval-check command returns non-zero or times out) while leaving the probe
behavior unchanged, then call runConnect(tmpDir, sandboxName) and assert
result.status is still 0 to verify "sandbox connect" proceeds despite
approval-pass failure; target the harness that simulates "sandbox exec" (the
fixture stub used by setupFixture) and tweak its behavior for the approval-pass
invocation only.
---
Outside diff comments:
In `@scripts/nemoclaw-start.sh`:
- Around line 1400-1446: The openclaw CLI invocations (calls to run with
OPENCLAW for 'devices list --json' and 'devices approve <id> --json') use an
unbounded subprocess.run and can hang forever; add a reasonable timeout (e.g. a
few seconds) to those calls by updating the run invocation or its implementation
to accept a timeout parameter, catch subprocess.TimeoutExpired, kill/cleanup the
child, and treat the call as a failure (log the timeout and continue/retry) so
the watcher can progress to the FAST_DEADLINE/SLOW_MODE transitions and overall
DEADLINE; update handling around run(...) for both the list and approve branches
to handle timeout errors similarly and include short logs showing the request_id
or that list timed out.
---
Nitpick comments:
In `@scripts/nemoclaw-start.sh`:
- Around line 1345-1491: The change to the long-lived auto-pair watcher in
start_auto_pair (embedded Python block using OPENCLAW, FAST_DEADLINE, DEADLINE,
SLOW_INTERVAL, etc.) affects sandbox boot/process lifecycle and must be
validated with sandbox entrypoint E2Es; run the nightly E2E suite targeting the
sandbox survival/operations, cloud, and openclaw slack pairing jobs (use the
provided gh workflow command) and verify the watcher behavior (fast->slow
transition, pending approvals, HANDLED logic, and PID handling) across sandbox
reboots and long-lived sessions, then fix any failures (adjust timing/env vars,
ensure STEP_DOWN_PREFIX_SANDBOX interaction, and that AUTO_PAIR_PID is
set/exported) before merging.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 489abac7-9e0c-4bda-aadc-267b1652252e
📒 Files selected for processing (4)
scripts/nemoclaw-start.shsrc/lib/actions/sandbox/connect.tstest/nemoclaw-start.test.tstest/sandbox-connect-inference.test.ts
…ailure CodeRabbit feedback on NVIDIA#4292: 1. The auto-pair watcher's `run()` helper called `subprocess.run` with no timeout, so a wedged `openclaw devices list`/`approve` child could pin the watcher past the fast→slow transition and the 8h DEADLINE. Add a NEMOCLAW_AUTO_PAIR_RUN_TIMEOUT_SECS-bounded timeout (default 10s) and treat the timeout as a transient failure (rc=124) so the watcher loop continues. New test exercises a fake openclaw that sleeps longer than the per-call timeout and asserts the watcher reaches its deadline rather than blocking on the wedged child. 2. The connect-time approval-pass tolerance test previously only re-checked the happy path because the fake openshell returns 0 for any non-probe `sandbox exec`. Add a NEMOCLAW_TEST_FAIL_APPROVAL_PASS hook in the fixture that makes the approval-pass exec exit non-zero only when its argv contains the `openclaw devices approve` signature, and update the tolerance test to set the hook, assert that the approval-pass exec was attempted, and that SSH handoff still happens despite the failure. Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@scripts/nemoclaw-start.sh`:
- Around line 1396-1413: The run() helper converts timeouts into return code
124, but the caller that appends the requestId to HANDLED is adding it
unconditionally; change that caller to detect the timeout sentinel (return code
124 from run()) and do not add the requestId to HANDLED in that case so
transient timeouts can be retried; specifically, in the code that calls run(...)
for the "devices approve" flow (the same scope that currently pushes requestId
into HANDLED), only add requestId to HANDLED when the returned rc is not 124
(e.g., rc == 0 or other non-timeout errors can be handled as before).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 291ddd87-8c1e-41a2-9ceb-01f9d975e6f0
📒 Files selected for processing (3)
scripts/nemoclaw-start.shtest/nemoclaw-start.test.tstest/sandbox-connect-inference.test.ts
🚧 Files skipped from review as they are similar to previous changes (1)
- test/sandbox-connect-inference.test.ts
CodeRabbit feedback on NVIDIA#4292: the watcher's `run()` helper returns rc=124 on timeout, but the approve caller still added the requestId to HANDLED unconditionally. A transient timeout would therefore strand the late scope upgrade the watcher is supposed to recover. Detect the rc=124 timeout sentinel and skip the HANDLED.add() in that case so the next poll retries. Permanent failures (other non-zero rc) still get HANDLED so we don't spin on a stuck bad request. New regression test uses a fake openclaw whose first `devices approve flaky-cli` call hangs past the per-call timeout and whose second succeeds, then asserts: the timeout was logged, the retry succeeded, and the approve log records exactly one approval. Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
CodeRabbit findings resolution statusFor reviewer convenience — mapping each CodeRabbit inline finding from the earlier reviews to the commit that resolved it. The most recent CodeRabbit review (Run ID
CodeRabbit's overall check status is currently |
Summary
Fixes NemoClaw#4263. The in-sandbox auto-pair watcher used to exit as soon as browser or any device pairing converged (originally added in NemoClaw#2484 to avoid gateway connect-handler saturation). When an OpenClaw CLI agent later requested an additional scope, the gateway held the upgrade as pending approval forever and the agent fell back to embedded mode. On the shared multi-sandbox gateway both sandboxes hit this consistently after the watcher quiet-exit.
Related Issue
Fixes #4263.
Changes
scripts/nemoclaw-start.shauto-pair watcher: replaces convergence-break with a slow-mode keepalive. After convergence (or afterNEMOCLAW_AUTO_PAIR_FAST_DEADLINE_SECS, default 600s), the watcher transitions to slow polling atNEMOCLAW_AUTO_PAIR_SLOW_INTERVAL_SECS(default 30s) up toNEMOCLAW_AUTO_PAIR_DEADLINE_SECS(default 8h) and keeps approving allowlistedopenclaw-control-ui/webchat/cliscope upgrades. The fast-deadline transition is evaluated before the pending-branchcontinue, so a sticky pending request (rejected unknown client, permanent approve failure) cannot hold the watcher at 1s polling for the full deadline — avoiding a regression of the NemoClaw#2484 connect-handler pile-up on a longer timeline.src/lib/actions/sandbox/connect.tsadds a one-shot defense-in-depth approval pass before the SSH handoff. It sources/tmp/nemoclaw-proxy-env.sh, applies the same allowlist, caps atMAX_APPROVALS = 8with an outer12stimeout (sized to cover2s list + 1s × MAX_APPROVALSplus shell/python startup slack), and swallows every failure so it never blocks connect. This catches the case where the watcher crashed or exhausted its deadline between sandbox start and the user's first connect.Type of Change
Verification
npm testfor the touched test files passes (test/nemoclaw-start.test.ts,test/sandbox-connect-inference.test.ts: 103/103)npm run typecheck:clipassesTest plan
npx vitest run test/nemoclaw-start.test.ts -t "auto-pair"— 5 tests passnpx vitest run test/sandbox-connect-inference.test.ts -t "#4263"— 2 tests passnpx vitest run test/nemoclaw-start.test.ts test/sandbox-connect-inference.test.ts— 103 tests passnpm run typecheck:cli— cleanSigned-off-by: Yimo Jiang yimoj@nvidia.com
Summary by CodeRabbit
New Features
Bug Fixes
Tests