Skip to content

fix(connect): fail fast when gateway is down#3853

Open
chengjiew wants to merge 2 commits into
mainfrom
fix/3821_connect-gateway-down-guidance
Open

fix(connect): fail fast when gateway is down#3853
chengjiew wants to merge 2 commits into
mainfrom
fix/3821_connect-gateway-down-guidance

Conversation

@chengjiew
Copy link
Copy Markdown
Contributor

@chengjiew chengjiew commented May 20, 2026

Summary

  • fail fast during nemoclaw <sandbox> connect readiness polling when the named NemoClaw/OpenShell gateway is down or unreachable
  • include recovery guidance to restart the named gateway or rerun nemoclaw onboard
  • preserve stuck-sandbox timeout behavior when readiness has a concrete non-terminal phase such as Provisioning
  • add a regression test for unknown readiness status plus disconnected gateway lifecycle

Fixes #3821

Test Plan

  • npm run build:cli
  • npm run typecheck:cli
  • npm test -- test/cli.test.ts -t "fails fast with gateway recovery guidance"
  • npm test -- test/sandbox-stuck-recovery.test.ts
  • npm test -- test/cli.test.ts -t "connect|gateway metadata exists|gateway is no longer configured|gateway recovery guidance|sandbox readiness"
  • git diff --check

Summary by CodeRabbit

Bug Fixes

  • The connect command now detects gateway unavailability immediately and fails fast with recovery guidance, instead of waiting for a timeout.

Review Change Stack

Signed-off-by: Chengjie Wang chengjiew@nvidia.com

Signed-off-by: Chengjie Wang <chengjiew@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 20, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 20, 2026

📝 Walkthrough

Walkthrough

When the OpenShell gateway becomes disconnected or unreachable, nemoclaw connect now detects this condition during readiness probing and fails immediately with actionable recovery guidance rather than timing out after 120 seconds. The sandbox readiness polling logic was refactored to capture and branch on sandbox list command success/failure and check gateway lifecycle state blocking conditions.

Changes

Gateway Availability Detection and Recovery

Layer / File(s) Summary
Gateway readiness contracts and detection helpers
src/lib/actions/sandbox/connect.ts (lines 36–60, 175–223)
Imported gateway lifecycle state query and DNS proxy utilities. Defined SandboxListProbe type to track both sandbox list command status and output. Added helpers to detect blocking gateway lifecycle states, emit "gateway unavailable" errors with recovery instructions, and match gateway-unavailable patterns from command output via regex.
Initial readiness check with gateway failure handling
src/lib/actions/sandbox/connect.ts (lines 704–724)
Refactored initial sandbox readiness probe to use runSandboxList() returning { status, output }. When the list command fails and output indicates gateway unavailability, the code now terminates immediately via the gateway-unavailable error path. The "gateway blocks readiness" check is now conditional on the list command succeeding and status being unknown.
Polling loop probe with gateway failure handling
src/lib/actions/sandbox/connect.ts (lines 748–765)
Updated readiness polling loop to re-probe sandbox list as { status, output } each iteration. Applied the same gateway-unavailable failure handling when probes fail and output matches disconnection/connection-refused patterns. The unknown-status "gateway blocks readiness" check is now gated on probe success.
Test for gateway disconnection scenario
test/cli.test.ts (lines 3763–3851)
New test verifying that when gateway status is stubbed as Disconnected, nemoclaw <sandbox> connect exits with code 1, includes recovery guidance mentioning nemoclaw onboard, avoids the timeout path, and performs readiness status probing without invoking sandbox connect.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Suggested labels

NemoClaw CLI, fix, Sandbox, bug

Suggested reviewers

  • jyaunches
  • cv

Poem

🐰 When gateways fade and threads grow long,
A rabbit's fix rights what went wrong.
No more the hundred-twenty-second wait—
Just quick diagnosis, recovery state!
"Onboard again!" the message sings,
And connectivity spreads its wings.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'fix(connect): fail fast when gateway is down' directly summarizes the main change in the PR—adding fast-fail detection for when the OpenShell gateway is unavailable during connect readiness polling.
Linked Issues check ✅ Passed The PR implements the fast-fail recovery pathway from issue #3821 by detecting gateway unavailability in connect polling and failing immediately with actionable recovery guidance, fully addressing the linked issue's objectives [#3821].
Out of Scope Changes check ✅ Passed All changes (connect.ts refactoring and new CLI test) are directly scoped to detecting gateway unavailability and failing fast with recovery guidance, with no unrelated modifications present.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/3821_connect-gateway-down-guidance

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 20, 2026

E2E Advisor Recommendation

Required E2E: sandbox-operations-e2e
Optional E2E: sandbox-survival-e2e, issue-2478-crash-loop-recovery-e2e

Dispatch hint: sandbox-operations-e2e

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

  • sandbox-operations-e2e (high (~60 min timeout; creates multiple live sandboxes)): Best existing live coverage for NemoClaw sandbox operations and gateway lifecycle behavior. It validates sandbox listing/status/logs, connect/chat-adjacent sandbox access, multi-sandbox behavior, and explicit gateway lifecycle responses after gateway disruption, which are the closest existing E2E risks touched by connect readiness changes.

Optional E2E

  • sandbox-survival-e2e (medium-high (~30 min timeout; live onboard + gateway restart)): Useful adjacent confidence for gateway stop/start recovery and post-restart sandbox usability. It exercises NemoClaw status-driven recovery and verifies no re-onboard is needed after gateway restart, but it does not directly assert the new connect fail-fast guidance.
  • issue-2478-crash-loop-recovery-e2e (medium-high (~30 min timeout; repeated crash/recovery loop)): Optional soak coverage for nemoclaw <name> connect --probe-only recovery after sandbox gateway process crashes. The changed code is mainly the non-probe readiness path, so this is adjacent rather than merge-blocking.

New E2E recommendations

  • connect readiness negative path / disconnected OpenShell gateway (high): No existing E2E appears to directly force openshell sandbox list or named gateway lifecycle into a disconnected/unreachable state while running real nemoclaw <sandbox> connect, then assert fast failure with openshell gateway start --name nemoclaw / nemoclaw onboard guidance and no SSH attempt. The added unit test stubs this, but live OpenShell output/state can differ.
    • Suggested test: Add a regression E2E that onboards or registers a test sandbox, stops/destroys the named OpenShell gateway or injects a disconnected gateway state, runs NEMOCLAW_CONNECT_TIMEOUT=1 nemoclaw <sandbox> connect, and asserts it exits quickly with gateway recovery guidance rather than timing out waiting for readiness.

Dispatch hint

  • Workflow: nightly-e2e.yaml
  • jobs input: sandbox-operations-e2e

@chengjiew chengjiew added the v0.0.47 Release target label May 20, 2026
@wscurran wscurran added the fix label May 20, 2026
@wscurran
Copy link
Copy Markdown
Contributor

@wscurran wscurran added NemoClaw CLI Use this label to identify issues with the NemoClaw command-line interface (CLI). OpenShell Support for OpenShell, a safe, private runtime for autonomous AI agents labels May 20, 2026
@github-actions
Copy link
Copy Markdown
Contributor

PR Review Advisor

Recommendation: blocked
Confidence: high
Analyzed HEAD: 2025c589be8f336776c196d3082315ee249bcf4e
Findings: 3 blocker(s), 2 warning(s), 0 suggestion(s)

This is an automated advisory review. A human maintainer must make the final merge decision.

Limitations: Review used provided trusted metadata and read-only repository inspection; no tests, package-manager commands, scripts, or workflows were executed.; CI and E2E results were evaluated only from the provided GitHub/status context; several checks were still IN_PROGRESS.; No live OpenShell gateway or real sandbox was available to validate docker-kill behavior.; Issue and PR text were treated as untrusted evidence and only used for acceptance mapping.

Workflow run

Full advisor summary

PR Review Advisor

Base: origin/main
Head: HEAD
Analyzed SHA: 2025c589be8f336776c196d3082315ee249bcf4e
Recommendation: blocked
Confidence: high

The connect behavior change is directionally correct and has a focused unit regression, but merge is blocked by pending CI/mergeStateStatus=BLOCKED, missing required sandbox E2E evidence for this head SHA, and monolith growth in connect.ts.

Gate status

  • CI: pending — For head SHA 2025c58, multiple status contexts are still IN_PROGRESS, including cli-parity, E2E recommendation, wsl-e2e, macos-e2e, PR review advisor, CodeQL jobs, checks, and ShellCheck SARIF.
  • Mergeability: fail — GitHub GraphQL reports mergeStateStatus=BLOCKED for PR fix(connect): fail fast when gateway is down #3853.
  • Review threads: pass — GraphQL reviewThreads.nodes is empty and REST review_comments=0. CodeRabbit has a summary/pre-merge comment but no line review comments in the provided metadata.
  • Risky code tested: warning — A unit regression was added in test/cli.test.ts, but the modified runtime/sandbox connect readiness path requires real sandbox E2E validation per the E2E Advisor.

🔴 Blockers

  • Hard gate blocked: CI is still pending and merge state is BLOCKED: The PR cannot be considered ready while required checks are still in progress and GitHub reports the PR as blocked.
    • Recommendation: Wait for all required status contexts to complete successfully for head SHA 2025c58 and resolve the blocked merge state before merge consideration.
    • Evidence: Status rollup shows IN_PROGRESS checks including cli-parity, E2E recommendation, wsl-e2e, macos-e2e, CodeQL, checks, and ShellCheck SARIF; mergeStateStatus=BLOCKED.
  • Required sandbox E2E evidence is missing for the current head SHA (src/lib/actions/sandbox/connect.ts:704): The changed code is in the runtime sandbox connect/readiness path. The E2E Advisor requires sandbox-operations-e2e because the PR changes real OpenShell sandbox readiness behavior, but no passing required E2E result for the current head SHA was provided.
    • Recommendation: Obtain a passing sandbox-operations-e2e result for head SHA 2025c58. Consider adding the Advisor-suggested targeted disconnected-gateway connect E2E if existing jobs do not exercise this exact failure mode.
    • Evidence: E2E Advisor comment: Required E2E: sandbox-operations-e2e; provided status rollup shows E2E recommendation still IN_PROGRESS and no passed sandbox-operations-e2e result for this head SHA.
  • connect.ts monolith grew past the enforced delta threshold (src/lib/actions/sandbox/connect.ts): Trusted monolith analysis reports src/lib/actions/sandbox/connect.ts grew from 752 to 830 lines, a +78 line increase. The repo policy flags current monolith growth by 20 or more lines as a blocker.
    • Recommendation: Extract the new gateway-readiness classification and error formatting helpers into a focused module or otherwise offset the growth before merge.
    • Evidence: monolithDeltas: file=src/lib/actions/sandbox/connect.ts, baseLines=752, headLines=830, delta=78, severity=blocker.

🟡 Warnings

  • Unit coverage exercises the status-disconnected path but not all new gateway-unavailable branches (test/cli.test.ts:3763): The added regression covers a successful sandbox list returning unknown plus openshell status showing Disconnected. The new code also has a separate branch for failed sandbox list output matching Connection refused, No gateway configured, client error (Connect), or tcp connect error.
    • Recommendation: Add focused unit coverage for a non-zero openshell sandbox list result whose output indicates gateway unavailability, and verify it fails fast with recovery guidance without waiting or invoking sandbox connect.
    • Evidence: connect.ts adds outputShowsGatewayUnavailable() and listCommandFailed/pollCommandFailed branches; the new test stubs sandbox list with exit 0 and output 'alpha unknown 103s ago'.
  • Active PR overlap on the large CLI test file increases integration risk (test/cli.test.ts): Multiple open PRs modify test/cli.test.ts. This PR's production change still exists and appears relevant, but the shared monolithic test file is actively changing and may create merge or behavior drift.

🔵 Suggestions

  • None.

Acceptance coverage

  • partial — Issue [Nemoclaw][All Platforms]  nemoclaw connect times out with 'Status: unknown' after gateway docker kill; no auto-recovery or clear recovery guidance #3821 title: "[Nemoclaw][All Platforms]  nemoclaw connect times out with 'Status: unknown' after gateway docker kill; no auto-recovery or clear recovery guidance": The diff implements clear recovery guidance and fail-fast behavior for the tested disconnected-gateway path, but current-head E2E evidence for a real gateway docker kill is missing.
  • partial — Description: "After forcibly killing the OpenShell gateway container (openshell-cluster…) with docker kill, nemoclaw connect for an existing sandbox waits the full 120s connect timeout and then fails with Status: unknown and a generic timeout message": The new unit test asserts output does not contain 'Timed out after 1s' and code exits 1 quickly; however it uses a stubbed openshell script rather than a real killed gateway container.
  • met — Description: "instead of either auto‑recovering the gateway or immediately reporting a clear “gateway is down, here’s how to recover” error.": connect.ts adds failConnectReadinessGatewayUnavailable(), which prints 'OpenShell gateway is not running or unreachable' and recovery steps; the PR chooses the clear-error alternative rather than auto-recovery.
  • met — Description: "There is no explicit guidance to re‑run nemoclaw onboard or restart the gateway container": connect.ts now prints 'Run: openshell gateway start --name nemoclaw', 'run: nemoclaw onboard', and 'Retry: nemoclaw connect'. The unit test asserts output contains 'nemoclaw onboard'.
  • unknown — Expected Result 1: "nemoclaw status initially reports healthy (before gateway kill).": This is a precondition/real-environment condition and is not established by the stubbed unit test or the diff.
  • partial — Expected Result 2–4: "When you run nemoclaw prachi-new-sb connect under these conditions, one of two behaviors should occur": The diff implements behavior b) for a stubbed disconnected gateway path. Real under-these-conditions E2E evidence is still missing.
  • missing — Expected Result a): "Auto‑recovery supported:": The changed readiness path does not auto-recover during connect polling; it fails with guidance. This is acceptable only because the issue allows either a) or b).
  • missing — Expected Result a): "NemoClaw (or OpenShell) detects the missing gateway and restarts it automatically (e.g. via Docker) within the connect window.": No connect-readiness restart is added in connect.ts; failConnectReadinessGatewayUnavailable exits instead.
  • missing — Expected Result a): "The gateway becomes healthy again.": No auto-recovery-to-healthy behavior is added in this PR's connect polling path.
  • missing — Expected Result a): "nemoclaw prachi-new-sb connect succeeds: sandbox becomes ready and you get a shell inside the sandbox.": For the gateway-down scenario, the new code intentionally exits non-zero with guidance rather than succeeding.
  • unknown — Expected Result a): "Inside the sandbox, running: openclaw agent --agent main -m "hello" --session-id recovery produces a successful inference response.": No E2E or unit evidence exercises successful post-recovery shell/inference because the PR implements the fail-fast alternative.
  • met — Expected Result b): "Auto‑recovery NOT supported:": The implemented path is the non-auto-recovery alternative.
  • partial — Expected Result b): "nemoclaw prachi-new-sb connect should quickly fail with a clear error message indicating that the OpenShell gateway is down or unreachable.": Unit test verifies exit code 1, no timeout message, and output containing 'OpenShell gateway is not running or unreachable'; real E2E for killed gateway is missing.
  • met — Expected Result b): "The error output should include explicit recovery steps, for example: OpenShell gateway is not running or unreachable. Run: nemoclaw onboard to recreate the gateway, then retry your connect command.": connect.ts prints a Recovery section with openshell gateway start, nemoclaw onboard, and retry instructions; test asserts 'nemoclaw onboard'.
  • met — Expected Result b): "The command should exit non‑zero.": failConnectReadinessGatewayUnavailable() calls process.exit(1); the new test asserts r.code is 1.
  • unknown — Expected Result b): "A follow‑up nemoclaw status should reflect the degraded state or point to the missing gateway with guidance.": This PR changes connect.ts and adds a connect-focused test. It does not add or demonstrate follow-up nemoclaw status behavior.
  • partial — Actual Result key observation: "nemoclaw connect waits the full default 120 seconds before failing.": The unit test sets NEMOCLAW_CONNECT_TIMEOUT=1 and verifies the timeout message is not emitted; real 120s docker-kill behavior remains unverified by E2E.
  • met — Actual Result key observation: "The only status it shows during the failure is Status: unknown and a generic timeout message.": The new failure path emits a gateway-unreachable message before timeout for the tested status unknown plus disconnected-gateway lifecycle.
  • met — Actual Result key observation: "There is no mention that the OpenShell gateway was killed or is not running.": The new output explicitly says 'OpenShell gateway is not running or unreachable'.
  • met — Actual Result key observation: "There is no explicit recovery guidance such as “Run nemoclaw onboard to recreate the gateway” or “Restart the openshell-cluster container.”": The Recovery section includes 'openshell gateway start --name nemoclaw' and 'nemoclaw onboard'.

Security review

  • pass — 1. Secrets and Credentials: No hardcoded secrets, tokens, passwords, PEM files, credential JSON, or new credential-handling paths were introduced. Test fixtures create local temp files and stub commands only.
  • pass — 2. Input Validation and Data Sanitization: The change classifies OpenShell command output using bounded regular expressions and continues passing sandboxName as an argv element rather than interpolating it into a host shell command. No unsafe deserialization, eval, SQL, or URL parsing was added.
  • pass — 3. Authentication and Authorization: No new endpoints, authentication checks, authorization decisions, token validation, or privilege model changes were introduced.
  • pass — 4. Dependencies and Third-Party Libraries: No package manifest or dependency changes are included.
  • warning — 5. Error Handling and Logging: The new failure path may print raw OpenShell status/list output via detailOutput. This is local CLI error output and existing gateway-state paths also print diagnostics, but maintainers should ensure OpenShell status/gateway info never includes secrets before relying on this output in support logs.
  • pass — 6. Cryptography and Data Protection: Not applicable — no cryptographic operations, algorithms, key management, or transport-security settings are changed.
  • pass — 7. Configuration and Security Headers: No HTTP endpoints, CORS/CSP settings, Dockerfiles, security headers, port exposure, or runtime configuration defaults are changed.
  • warning — 8. Security Testing: A useful unit regression covers fail-closed behavior for a disconnected gateway lifecycle, but runtime/sandbox infrastructure behavior needs the required sandbox-operations-e2e for the current head SHA. The failed sandbox-list gateway-unavailable branch is not directly unit-tested.
  • warning — 9. Holistic Security Posture: The intended posture improves fail-closed behavior when the gateway is unhealthy, reducing confusing timeout behavior. However, without current-head E2E against real OpenShell gateway failure modes, there is residual risk of behavior drift in sandbox lifecycle handling.

Test / E2E status

  • Test depth: e2e_required — The production change is in src/lib/actions/sandbox/connect.ts, a runtime/sandbox readiness path that interacts with OpenShell gateway lifecycle state. The added unit test is valuable but cannot prove real gateway-down behavior after docker kill or confirm required jobs passed for the current head SHA.
  • E2E Advisor: missing
  • Required E2E jobs: sandbox-operations-e2e
  • Missing for analyzed SHA: sandbox-operations-e2e

✅ What looks good

  • The PR targets code that still exists and is directly related to the linked issue: src/lib/actions/sandbox/connect.ts readiness polling and test/cli.test.ts CLI dispatch coverage.
  • The new connect failure message is actionable and includes both gateway restart and nemoclaw onboard recovery guidance.
  • The added regression verifies non-zero exit, avoids the timeout path, and prevents invoking sandbox connect in the disconnected-gateway scenario.
  • The production change preserves argv-based OpenShell invocation for sandbox names rather than constructing host shell strings from user input.

Review completeness

  • Review used provided trusted metadata and read-only repository inspection; no tests, package-manager commands, scripts, or workflows were executed.
  • CI and E2E results were evaluated only from the provided GitHub/status context; several checks were still IN_PROGRESS.
  • No live OpenShell gateway or real sandbox was available to validate docker-kill behavior.
  • Issue and PR text were treated as untrusted evidence and only used for acceptance mapping.
  • Human maintainer review required: yes

@cv cv added v0.0.49 Release target and removed v0.0.47 Release target labels May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

fix NemoClaw CLI Use this label to identify issues with the NemoClaw command-line interface (CLI). OpenShell Support for OpenShell, a safe, private runtime for autonomous AI agents v0.0.49 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Nemoclaw][All Platforms]  nemoclaw connect times out with 'Status: unknown' after gateway docker kill; no auto-recovery or clear recovery guidance

3 participants