Skip to content

fix(connect): fail fast when gateway is down#4022

Merged
jyaunches merged 2 commits into
mainfrom
fix/3821-connect-gateway-down-guidance-squash
May 22, 2026
Merged

fix(connect): fail fast when gateway is down#4022
jyaunches merged 2 commits into
mainfrom
fix/3821-connect-gateway-down-guidance-squash

Conversation

@jyaunches
Copy link
Copy Markdown
Contributor

@jyaunches jyaunches commented May 22, 2026

Summary

  • fail fast during nemoclaw <sandbox> connect readiness polling when the named NemoClaw/OpenShell gateway is down or unreachable
  • include recovery guidance to restart the named gateway or rerun nemoclaw onboard
  • preserve stuck-sandbox timeout behavior when readiness has a concrete non-terminal phase such as Provisioning
  • add a regression test for unknown readiness status plus disconnected gateway lifecycle

Supersedes #3853 with the same patch squashed onto current main to avoid the commit signature/merge-commit issues on that branch.

Fixes #3821

Test Plan

  • git diff --check
  • npm test -- test/cli.test.ts -t "fails fast with gateway recovery guidance" (not run locally: dependencies are not installed in this worktree; CI will run)
  • Prior PR fix(connect): fail fast when gateway is down #3853 evidence before the squash: PR CI green; sandbox-operations-e2e passed on previous head; latest E2E advisor requires sandbox-operations-e2e for current head and it will need to be re-run here.

Signed-off-by: Julie Yaunches jyaunches@nvidia.com

Summary by CodeRabbit

  • Bug Fixes

    • Improved sandbox connection readiness to detect OpenShell gateway unavailability earlier and fail fast with clearer, actionable error messages and recovery steps.
  • Tests

    • Added CLI test coverage ensuring connect attempts fail promptly when the gateway is reported disconnected, validating exit behavior and user guidance without entering the normal connect flow.

Review Change Stack

Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 22, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f157779e-dbbb-4e31-8f0e-c2510dff27f9

📥 Commits

Reviewing files that changed from the base of the PR and between 27d2d73 and 5e63328.

📒 Files selected for processing (1)
  • src/lib/actions/sandbox/connect.ts

📝 Walkthrough

Walkthrough

Connect readiness now treats openshell sandbox list probes as structured {status, output}, detects gateway unavailability from probe output or lifecycle state, fails fast with a recovery message (including lifecycle hint), and adds a CLI test asserting the fast-fail behavior.

Changes

Gateway Unavailability Detection and Fast Failure

Layer / File(s) Summary
Imports and probe type
src/lib/actions/sandbox/connect.ts
Add gateway lifecycle inspection imports and SandboxListProbe type to carry command status and output.
Gateway-unavailability helpers
src/lib/actions/sandbox/connect.ts
Add regex classifier for gateway-unavailable openshell sandbox list output, predicate using getNamedGatewayLifecycleState() for blocking states, and a fatal error routine that prints recovery steps and lifecycle hint; provide wrapper to enforce fail-if-blocking.
Readiness probe refactor and fast-fail wiring
src/lib/actions/sandbox/connect.ts
Refactor initial probe and polling loop to use structured {status, output}; fail immediately when the list command fails with gateway-unavailability output; when parsed status is unknown, consult gateway lifecycle and fail with gateway-unavailable flow when blocking.
CLI dispatch test for gateway down
test/cli.test.ts
New vitest simulates a disconnected gateway via openshell stubs and asserts nemoclaw alpha connect exits 1 with gateway recovery guidance, no 1s timeout message, and no attempted sandbox connect.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested labels

bug, NemoClaw CLI, fix, Networking

Suggested reviewers

  • ericksoa

Poem

A rabbit hops where gateways sleep,
Sniffs the logs instead of counting sheep.
"If openshell's down," it gently cries,
"Run onboard next" — no more long tries! 🐰

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix(connect): fail fast when gateway is down' directly and clearly describes the main objective of the PR to fail fast when the gateway is unavailable.
Linked Issues check ✅ Passed The code changes fully implement the core coding requirements from issue #3821: detecting gateway unavailability via regex matching and lifecycle checks, failing fast with a clear 'gateway unavailable' message, and providing recovery guidance.
Out of Scope Changes check ✅ Passed All changes in the PR are directly scoped to the linked issue #3821: connect.ts implements gateway-lifecycle-aware readiness checks and cli.test.ts adds a regression test for the gateway-down scenario.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/3821-connect-gateway-down-guidance-squash

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 22, 2026

E2E Advisor Recommendation

Required E2E: sandbox-operations-e2e
Optional E2E: sandbox-survival-e2e, inference-routing-e2e

Dispatch hint: sandbox-operations-e2e

Auto-dispatched E2E: sandbox-operations-e2e via nightly-e2e.yaml at 5e633285bdb22d77f58fbcd8215686f09def2725nightly run

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

  • sandbox-operations-e2e (high (~60 min)): Directly exercises sandbox list/connect/status/logs/destroy plus gateway auto-recovery and process recovery. This is the closest existing E2E coverage for changes in the sandbox connect readiness path and gateway lifecycle handling.

Optional E2E

  • sandbox-survival-e2e (medium (~30 min)): Provides adjacent confidence for sandbox availability across gateway restarts and verifies post-restart status/SSH/inference behavior. Useful because this PR changes connect/readiness behavior around gateway unavailability, but it is less directly targeted than sandbox-operations-e2e.
  • inference-routing-e2e (medium (~30 min)): Optional adjacency only: connect.ts still gates SSH after ensuring the sandbox inference route. The diff does not primarily change inference routing, but this can catch regressions if readiness/gateway handling prevents route verification in real environments.

New E2E recommendations

  • sandbox lifecycle / gateway unavailable negative path (high): Existing E2E coverage exercises gateway recovery and sandbox operations, but there does not appear to be a real E2E negative test that forces nemoclaw <sandbox> connect readiness polling to see a disconnected/unreachable named gateway and asserts it fails fast with recovery guidance instead of timing out or attempting SSH.
    • Suggested test: Add a sandbox lifecycle E2E case, preferably under sandbox-operations or validation_suites/sandbox/lifecycle, that creates or mocks a real pending sandbox state, stops/disconnects the named nemoclaw gateway, runs nemoclaw <sandbox> connect, and asserts the new fail-fast recovery guidance plus no openshell sandbox connect attempt.

Dispatch hint

  • Workflow: nightly-e2e.yaml
  • jobs input: sandbox-operations-e2e

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
test/cli.test.ts (1)

3916-3919: ⚡ Quick win

Assert the gateway-start recovery step too.

This only checks the nemoclaw onboard fallback. The new recovery guidance also promises the explicit openshell gateway start --name nemoclaw restart path, so this regression test won't catch that instruction disappearing.

Suggested assertion
       expect(r.code).toBe(1);
       expect(r.out).toContain("OpenShell gateway is not running or unreachable");
+      expect(r.out).toContain("openshell gateway start --name nemoclaw");
       expect(r.out).toContain("nemoclaw onboard");
       expect(r.out).not.toContain("Timed out after 1s");
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/cli.test.ts` around lines 3916 - 3919, The test currently asserts the
fallback "nemoclaw onboard" but misses verifying the explicit restart guidance;
update the assertions in test/cli.test.ts (the block that checks r.code and
r.out) to also expect the string "openshell gateway start --name nemoclaw" (or
the exact restart instruction emitted by the gateway recovery logic) to ensure
the new recovery guidance is present alongside "nemoclaw onboard" and still
assert that "Timed out after 1s" is not present.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/actions/sandbox/connect.ts`:
- Around line 175-186: isBlockingGatewayLifecycle() is missing the same fatal
gateway-unavailable patterns used by outputShowsGatewayUnavailable(), so
lifecycle states like "client error (Connect)" and "tcp connect error" don't
trigger blocking behavior; update the regex in isBlockingGatewayLifecycle (and
the similar check around the other occurrence referenced at the second location)
to include the same phrases used by outputShowsGatewayUnavailable()—specifically
add patterns for "client error \\(Connect\\)" and "tcp connect error" (and any
other platform-specific variants present in outputShowsGatewayUnavailable()) so
failIfGatewayBlocksConnectReadiness() will fire consistently when those messages
appear.

---

Nitpick comments:
In `@test/cli.test.ts`:
- Around line 3916-3919: The test currently asserts the fallback "nemoclaw
onboard" but misses verifying the explicit restart guidance; update the
assertions in test/cli.test.ts (the block that checks r.code and r.out) to also
expect the string "openshell gateway start --name nemoclaw" (or the exact
restart instruction emitted by the gateway recovery logic) to ensure the new
recovery guidance is present alongside "nemoclaw onboard" and still assert that
"Timed out after 1s" is not present.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 324baf8e-f686-4a8b-896c-46f587e25227

📥 Commits

Reviewing files that changed from the base of the PR and between 74c0246 and 27d2d73.

📒 Files selected for processing (2)
  • src/lib/actions/sandbox/connect.ts
  • test/cli.test.ts

Comment thread src/lib/actions/sandbox/connect.ts Outdated
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 22, 2026

PR Review Advisor

Recommendation: blocked
Confidence: high
Analyzed HEAD: 5e633285bdb22d77f58fbcd8215686f09def2725
Findings: 3 blocker(s), 3 warning(s), 0 suggestion(s)

This is an automated advisory review. A human maintainer must make the final merge decision.

Limitations: Review is based on the trusted deterministic context and provided diff; no scripts, package-manager commands, tests, or E2E workflows were executed by this advisor.; The linked issue body/comments and PR text are treated as untrusted evidence; acceptance mapping quotes them literally but does not rely on their instructions.; A sandbox-operations-e2e job was reported as auto-dispatched for the current head SHA, but no passing result for 5e63328 was included.; The selective E2E result comment references the previous target ref 27d2d73 and reports the requested job as cancelled, so it cannot satisfy current-head E2E.; Diff review was limited to the provided changed files and trusted metadata; no independent repository-wide dynamic validation was performed.

Workflow run

Full advisor summary

PR Review Advisor

Base: origin/main
Head: HEAD
Analyzed SHA: 5e633285bdb22d77f58fbcd8215686f09def2725
Recommendation: blocked
Confidence: high

Do not merge yet: GitHub mergeability is blocked, required sandbox E2E has not been shown passing for this head SHA, and the already-large sandbox connect action grew by +74 lines despite the matcher fix.

Gate status

  • CI: pass — Required contexts checks, commit-lint, dco-check, check-hash, and changes are reported successful for head SHA 5e63328. Non-required contexts still include one pending and one failed context.
  • Mergeability: fail — GitHub metadata reports mergeStateStatus=BLOCKED and reviewDecision=REVIEW_REQUIRED for head SHA 5e63328.
  • Review threads: pass — 1 review thread is present and is resolved; the CodeRabbit matcher-sync thread is marked addressed in commit 5e63328.
  • Risky code tested: pending — The changed path is sandbox/runtime connect readiness logic. Unit coverage was added, but trusted testDepth says e2e_required and the E2E Advisor required sandbox-operations-e2e for the current head; no passing sandbox-operations-e2e result for 5e63328 is included.

🔴 Blockers

  • Branch protection / mergeability remains blocked: The PR is not currently mergeable under GitHub's gate state, even though required CI contexts are green.
    • Recommendation: Resolve the blocked merge state and any required review/branch-protection requirements before considering merge readiness.
    • Evidence: GraphQL pullRequest reports mergeStateStatus=BLOCKED and reviewDecision=REVIEW_REQUIRED; trusted gateStatus.mergeability.status=fail.
  • Required sandbox E2E is not shown passing for the current head SHA (src/lib/actions/sandbox/connect.ts:620): This PR changes sandbox connect readiness and gateway lifecycle behavior. The E2E Advisor required sandbox-operations-e2e and auto-dispatched it at the current head SHA, but the provided data does not include a passing result for that required job at 5e63328.
    • Recommendation: Wait for sandbox-operations-e2e to complete successfully for head SHA 5e63328. Consider adding the Advisor-suggested real disconnected-gateway negative E2E if the existing job does not exercise that exact path.
    • Evidence: E2E Advisor comment: Required E2E: sandbox-operations-e2e; Auto-dispatched via nightly-e2e.yaml at 5e63328. No passed sandbox-operations-e2e result for this SHA is present; the later selective E2E result references old target ref 27d2d73 and reports cancelled.
  • Sandbox connect monolith grew beyond the enforced budget (src/lib/actions/sandbox/connect.ts:1): The already-large sandbox connect action grew by +74 lines, from 752 to 826 lines. Trusted monolith budget evidence marks growth of 20 or more lines as a blocker requiring extraction or offsetting before merge.
    • Recommendation: Extract gateway-unavailability classification, lifecycle gating, and recovery-message rendering into a focused helper module, or otherwise offset the growth so connect.ts does not continue expanding as a monolith.
    • Evidence: monolithDeltas: src/lib/actions/sandbox/connect.ts baseLines=752, headLines=826, delta=74, severity=blocker, rationale='Current monolith grew by 20 or more lines; extract or offset the growth before merge.'

🟡 Warnings

  • Unit test covers the main disconnected-gateway path but not all matcher variants (test/cli.test.ts:3758): The new CLI regression validates Status: Disconnected through the lifecycle/status path and confirms no sandbox connect attempt. It does not separately exercise list-command failure output variants such as client error (Connect) and tcp connect error, despite the shared regex now recognizing them.
    • Recommendation: Add focused negative tests for sandbox list command failure output and lifecycle status output containing client error (Connect) and tcp connect error, asserting fast failure and no timeout message.
    • Evidence: connect.ts defines GATEWAY_UNAVAILABLE_RE with No gateway configured, No active gateway, Connection refused, client error (Connect), tcp connect error, and Status: Disconnected. The added test only mocks openshell status with Status: Disconnected.
  • Gateway lifecycle detail output is printed directly to stderr (src/lib/actions/sandbox/connect.ts:194): The new fail-fast path prints detailOutput from OpenShell/lifecycle probes directly. This improves diagnostics, but maintainers should confirm those probe outputs cannot include sensitive environment, credential, or token material.
    • Recommendation: Confirm getNamedGatewayLifecycleState/captureOpenshell outputs are limited to non-sensitive gateway status. If they can contain secrets, redact known token/password/key patterns before printing.
    • Evidence: failConnectReadinessGatewayUnavailable() calls console.error(detailOutput.trimEnd()) and printGatewayLifecycleHint(detailOutput, sandboxName, console.error).
  • Active PR overlap on shared CLI dispatch test file (test/cli.test.ts:3758): The changed CLI test file overlaps with other active open PRs, increasing rebase and behavioral-drift risk in a central dispatch test suite.

🔵 Suggestions

  • None.

Acceptance coverage

  • partial — After forcibly killing the OpenShell gateway container (openshell-cluster…) with docker kill, nemoclaw connect for an existing sandbox waits the full 120s connect timeout and then fails with Status: unknown and a generic timeout message, instead of either auto‑recovering the gateway or immediately reporting a clear “gateway is down, here’s how to recover” error.: connect.ts now fails fast when sandbox list output or gateway lifecycle indicates unavailability. The new unit test asserts exit code 1, recovery guidance, and no 'Timed out after 1s'. Coverage is partial because no passing real killed-gateway E2E is shown for the current head SHA.
  • met — There is no explicit guidance to re‑run nemoclaw onboard or restart the gateway container, despite the test expecting either automatic recovery or actionable recovery instructions in this scenario.: failConnectReadinessGatewayUnavailable() prints Recovery steps including 'openshell gateway start --name nemoclaw', '${CLI_NAME} onboard', and retrying '${CLI_NAME} ${sandboxName} connect'. The unit test asserts output contains 'nemoclaw onboard'.
  • unknown — Platform: Linux (e.g. Ubuntu 22.04 / 24.04 / 26.04): No platform-specific real Linux gateway-kill evidence is provided for this head SHA; the added regression is a mocked CLI test.
  • unknown — GPU: Any supported GPU: The code path is not GPU-specific, and the unit fixture sets gpuEnabled=false. No GPU E2E evidence is provided.
  • unknown — Docker: Installed and running (supported NemoClaw/OpenShell runtime): The unit test does not run Docker. Required sandbox E2E has not been shown passing for the current head SHA.
  • unknown — NemoClaw CLI: v0.0.45: No version-specific validation evidence is provided; the PR modifies current main.
  • unknown — Running as a Docker container, name matching openshell-cluster / openshell-cluster-nemoclaw as in the OpenShell/NemoClaw guides.: The unit test simulates OpenShell commands via a temporary shell stub and does not inspect or kill a real gateway container.
  • partial — At least one sandbox onboarded and healthy, e.g. prachi-new-sb, with a working inference provider (e.g. nvidia-prod or ollama-local).: The test creates a registry entry for sandbox alpha with provider nvidia-prod and mocks 'openshell sandbox get alpha', but it does not exercise a real onboarded healthy sandbox or working inference provider.
  • unknown — NemoClaw CLI version and OpenShell version can be taken from nemoclaw version and openshell --version output (fill in your exact versions, similar to other issues).: No version command output is captured in the diff/test evidence.
  • unknown — Ensure NemoClaw CLI is installed and Docker is running.: No real installation or Docker precondition is tested; the regression test uses mocked local binaries.
  • unknown — Ensure OpenShell gateway is running as a container:: No real OpenShell gateway container is started or checked in the provided evidence.
  • unknown — You should see a container whose name includes openshell-cluster or openshell-cluster-nemoclaw.: No docker ps evidence is provided.
  • partial — Ensure at least one sandbox is onboarded and healthy, e.g. prachi-new-sb:: The unit test creates a mocked sandbox alpha registry entry and mocked get/list responses, but does not verify a real onboarded healthy sandbox.
  • unknownnemoclaw status should show healthy, and nemoclaw list should show prachi-new-sb with a valid model/provider.: No test or diff evidence validates 'nemoclaw status' healthy or 'nemoclaw list' output for a real sandbox.
  • unknown — Confirm overall NemoClaw health:: No real pre-kill health check is part of the test or provided E2E evidence.
  • partial — Force‑kill the OpenShell gateway container:: The unit test simulates a disconnected gateway through mocked 'openshell status' output with 'Status: Disconnected'; it does not docker kill a real gateway container.
  • unknown — Wait ~30 seconds to give any background retry logic a chance to run.: The PR intentionally implements fail-fast behavior and the unit test uses NEMOCLAW_CONNECT_TIMEOUT=1; no real 30-second wait scenario is shown.
  • met — Attempt to connect to an existing sandbox (example: prachi-new-sb):: The new test invokes 'alpha connect' against a mocked existing sandbox registry entry, exercising the connect command path.
  • met — Observe the status output and final result.: The test asserts output includes 'OpenShell gateway is not running or unreachable' and 'nemoclaw onboard', and excludes 'Timed out after 1s'.
  • met — After the command exits, capture the exit code:: The new test asserts r.code is 1.

Security review

  • pass — 1. Secrets and Credentials: No hardcoded secrets, API keys, passwords, tokens, PEMs, credential JSON, or .env files are added. The test writes only temporary registry/test fixture data under a temp HOME.
  • pass — 2. Input Validation and Data Sanitization: No new external input parser or unsafe deserialization is added. sandboxName remains passed as argv to OpenShell commands. The shared gateway-unavailable regex classifies command output only and does not introduce shell interpolation.
  • pass — 3. Authentication and Authorization: No endpoint, authentication flow, authorization decision, token validation, or privilege boundary is introduced or modified. The change affects local CLI readiness/error handling.
  • pass — 4. Dependencies and Third-Party Libraries: No package manager files or dependencies are changed.
  • warning — 5. Error Handling and Logging: The new fail-fast path improves recovery guidance, but prints detailOutput from OpenShell/lifecycle probes directly to stderr. This appears to be local diagnostic output, but maintainers should confirm it cannot include sensitive tokens, credentials, or excessive environment detail.
  • pass — 6. Cryptography and Data Protection: Not applicable — no cryptographic operations, key handling, encryption, hashing, or transport-security behavior are changed.
  • pass — 7. Configuration and Security Headers: No HTTP server, CORS/CSP, Dockerfile, container privilege, port exposure, or security-header configuration is changed.
  • warning — 8. Security Testing: A regression unit test was added for disconnected gateway guidance and the matcher-sync issue was addressed. However, this is sandbox/runtime infrastructure behavior and the E2E Advisor required sandbox-operations-e2e; a passing result for the current head SHA was not provided.
  • warning — 9. Holistic Security Posture: The change is intended to improve operator clarity when the OpenShell gateway is down and does not obviously introduce sandbox escape, SSRF bypass, policy bypass, credential leakage, blueprint tampering, installer trust, or workflow trusted-code-boundary issues. Residual reliability/security-posture risk remains until real sandbox E2E validates the gateway-down behavior.

Test / E2E status

  • Test depth: e2e_required — Runtime/sandbox/infrastructure paths need real execution coverage: src/lib/actions/sandbox/connect.ts changes connect readiness polling, OpenShell gateway lifecycle handling, and failure behavior. The added unit test is useful but mocks OpenShell and cannot prove behavior against a real killed/disconnected gateway container, real sandbox list/status outputs, or timing behavior.
  • E2E Advisor: missing
  • Required E2E jobs: sandbox-operations-e2e
  • Missing for analyzed SHA: sandbox-operations-e2e

✅ What looks good

  • The PR patches active files that still exist on the current branch; both src/lib/actions/sandbox/connect.ts and test/cli.test.ts have recent history and no rename hints.
  • The CodeRabbit matcher-sync review thread was addressed in commit 5e63328 by introducing a shared GATEWAY_UNAVAILABLE_RE used by both lifecycle and output classification.
  • The added regression test asserts non-zero exit, gateway recovery guidance, absence of the timeout message, and no sandbox connect attempt for the mocked disconnected-gateway path.
  • The fail-fast error message includes concrete recovery steps to restart the named OpenShell gateway, rerun nemoclaw onboard if needed, and retry connect.
  • The implementation preserves existing terminal sandbox-state handling and gates unknown readiness status through gateway lifecycle checks rather than broadly changing connect behavior.

Review completeness

  • Review is based on the trusted deterministic context and provided diff; no scripts, package-manager commands, tests, or E2E workflows were executed by this advisor.
  • The linked issue body/comments and PR text are treated as untrusted evidence; acceptance mapping quotes them literally but does not rely on their instructions.
  • A sandbox-operations-e2e job was reported as auto-dispatched for the current head SHA, but no passing result for 5e63328 was included.
  • The selective E2E result comment references the previous target ref 27d2d73 and reports the requested job as cancelled, so it cannot satisfy current-head E2E.
  • Diff review was limited to the provided changed files and trusted metadata; no independent repository-wide dynamic validation was performed.
  • Human maintainer review required: yes

Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
@github-actions
Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 26260944001
Target ref: 27d2d734763834f7d294aef2c99f9ddc2bb004ea
Workflow ref: main
Requested jobs: sandbox-operations-e2e
Summary: 0 passed, 0 failed, 0 skipped

Job Result
sandbox-operations-e2e ⚠️ cancelled

@github-actions
Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 26261277381
Target ref: 5e633285bdb22d77f58fbcd8215686f09def2725
Workflow ref: main
Requested jobs: sandbox-operations-e2e
Summary: 0 passed, 1 failed, 0 skipped

Job Result
sandbox-operations-e2e ❌ failure

Failed jobs: sandbox-operations-e2e. Check run artifacts for logs.

@github-actions
Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 26261691324
Target ref: fix/3821-connect-gateway-down-guidance-squash
Requested jobs: sandbox-operations-e2e
Summary: 0 passed, 1 failed, 0 skipped

Job Result
sandbox-operations-e2e ❌ failure

Failed jobs: sandbox-operations-e2e. Check run artifacts for logs.

@jyaunches jyaunches merged commit 1e62c58 into main May 22, 2026
87 of 89 checks passed
cv pushed a commit that referenced this pull request May 22, 2026
## Summary
Refreshes the NemoClaw docs for the v0.0.49 hardening release, including
release notes, command reference updates, troubleshooting guidance,
version metadata, and regenerated user skills.

## Changes
- #3796, #3854, #3863, #3866, #3984, #4001, #4011, #4013, #4020, #4022,
#4023, #4060, #4062 -> `docs/about/release-notes.mdx`: Adds the v0.0.49
hardening release summary covering gateway reliability,
status/doctor/shields and debug UX, OpenClaw compatibility, messaging
channel teardown, Hermes policy scoping, snapshots, source installs and
Docker group security note, GPU preflight, CLI usage, E2E, and CI
improvements.
- #3796 -> `docs/manage-sandboxes/backup-restore.mdx` and
`docs/reference/commands.mdx`: Documents `snapshot restore --to`
overwrite protection and the `--force` opt-in.
- #3863, #4013, #4020, #4023 -> `docs/reference/commands.mdx`: Documents
missing channel argument usage, sandbox-scoped custom preset matching,
session policy preset sync, and gateway failure classification (uses the
real probe states from `src/lib/status-command-deps.ts`).
- #4022, #4060, #4062 -> `docs/reference/troubleshooting.mdx`: Adds
guidance for gateway-down `connect`, source checkout OpenShell
bootstrapping, WDDM placeholder GPU names, and Jetson sandbox GPU
passthrough.
- Release prep -> `docs/project.json`, `docs/versions1.json`,
`.agents/skills/nemoclaw-user-*`: Bumps docs metadata to 0.0.49 and
refreshes generated user skills from the Fern docs.

## Type of Change
- [ ] Code change (feature, bug fix, or refactor)
- [ ] Code change with doc updates
- [ ] Doc only (prose changes, no code sample modifications)
- [x] Doc only (includes code sample changes)

## Verification
- [x] `npx prek run --all-files` passes
- [ ] `npm test` passes
- [ ] Tests added or updated for new or changed behavior
- [x] No secrets, API keys, or credentials committed
- [x] Docs updated for user-facing behavior changes
- [ ] `make docs` builds without warnings (doc changes only)
- [x] Doc pages follow the [style
guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md)
(doc changes only)
- [ ] New doc pages include SPDX header and frontmatter (new pages only)

\`make docs\` was attempted locally but did not complete because \`npm\`
returned \`403 Forbidden\` while fetching \`fern-api\` from
\`registry.npmjs.org\` in the sandboxed environment.

---
Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Documentation**
* Released v0.0.49 with reliability and compatibility improvements
including faster gateway failure diagnostics and safer snapshot restore
behavior
* Enhanced snapshot restore documentation with `--to` cloning and
`--force` overwrite requirements
* Expanded troubleshooting guides for source installs, GPU setup, and
gateway recovery
* Clarified Docker group access requirements and improved CLI command
reference

* **Chores**
  * Version bumped to 0.0.49

<!-- review_stack_entry_start -->

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/4078?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack)

<!-- review_stack_entry_end -->

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Nemoclaw][All Platforms]  nemoclaw connect times out with 'Status: unknown' after gateway docker kill; no auto-recovery or clear recovery guidance

3 participants