fix(connect): fail fast when gateway is down#4022
Conversation
Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughConnect readiness now treats ChangesGateway Unavailability Detection and Fast Failure
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Comment |
E2E Advisor RecommendationRequired E2E: Dispatch hint: Auto-dispatched E2E: Full advisor summaryE2E Recommendation AdvisorBase: Required E2E
Optional E2E
New E2E recommendations
Dispatch hint
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
test/cli.test.ts (1)
3916-3919: ⚡ Quick winAssert the gateway-start recovery step too.
This only checks the
nemoclaw onboardfallback. The new recovery guidance also promises the explicitopenshell gateway start --name nemoclawrestart path, so this regression test won't catch that instruction disappearing.Suggested assertion
expect(r.code).toBe(1); expect(r.out).toContain("OpenShell gateway is not running or unreachable"); + expect(r.out).toContain("openshell gateway start --name nemoclaw"); expect(r.out).toContain("nemoclaw onboard"); expect(r.out).not.toContain("Timed out after 1s");🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/cli.test.ts` around lines 3916 - 3919, The test currently asserts the fallback "nemoclaw onboard" but misses verifying the explicit restart guidance; update the assertions in test/cli.test.ts (the block that checks r.code and r.out) to also expect the string "openshell gateway start --name nemoclaw" (or the exact restart instruction emitted by the gateway recovery logic) to ensure the new recovery guidance is present alongside "nemoclaw onboard" and still assert that "Timed out after 1s" is not present.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/actions/sandbox/connect.ts`:
- Around line 175-186: isBlockingGatewayLifecycle() is missing the same fatal
gateway-unavailable patterns used by outputShowsGatewayUnavailable(), so
lifecycle states like "client error (Connect)" and "tcp connect error" don't
trigger blocking behavior; update the regex in isBlockingGatewayLifecycle (and
the similar check around the other occurrence referenced at the second location)
to include the same phrases used by outputShowsGatewayUnavailable()—specifically
add patterns for "client error \\(Connect\\)" and "tcp connect error" (and any
other platform-specific variants present in outputShowsGatewayUnavailable()) so
failIfGatewayBlocksConnectReadiness() will fire consistently when those messages
appear.
---
Nitpick comments:
In `@test/cli.test.ts`:
- Around line 3916-3919: The test currently asserts the fallback "nemoclaw
onboard" but misses verifying the explicit restart guidance; update the
assertions in test/cli.test.ts (the block that checks r.code and r.out) to also
expect the string "openshell gateway start --name nemoclaw" (or the exact
restart instruction emitted by the gateway recovery logic) to ensure the new
recovery guidance is present alongside "nemoclaw onboard" and still assert that
"Timed out after 1s" is not present.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 324baf8e-f686-4a8b-896c-46f587e25227
📒 Files selected for processing (2)
src/lib/actions/sandbox/connect.tstest/cli.test.ts
PR Review AdvisorRecommendation: blocked This is an automated advisory review. A human maintainer must make the final merge decision. Limitations: Review is based on the trusted deterministic context and provided diff; no scripts, package-manager commands, tests, or E2E workflows were executed by this advisor.; The linked issue body/comments and PR text are treated as untrusted evidence; acceptance mapping quotes them literally but does not rely on their instructions.; A sandbox-operations-e2e job was reported as auto-dispatched for the current head SHA, but no passing result for 5e63328 was included.; The selective E2E result comment references the previous target ref 27d2d73 and reports the requested job as cancelled, so it cannot satisfy current-head E2E.; Diff review was limited to the provided changed files and trusted metadata; no independent repository-wide dynamic validation was performed. Full advisor summaryPR Review AdvisorBase: Do not merge yet: GitHub mergeability is blocked, required sandbox E2E has not been shown passing for this head SHA, and the already-large sandbox connect action grew by +74 lines despite the matcher fix. Gate status
🔴 Blockers
🟡 Warnings
🔵 Suggestions
Acceptance coverage
Security review
Test / E2E status
✅ What looks good
Review completeness
|
Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
Selective E2E Results — ✅ All requested jobs passedRun: 26260944001
|
Selective E2E Results — ❌ Some jobs failedRun: 26261277381
|
Selective E2E Results — ❌ Some jobs failedRun: 26261691324
|
## Summary Refreshes the NemoClaw docs for the v0.0.49 hardening release, including release notes, command reference updates, troubleshooting guidance, version metadata, and regenerated user skills. ## Changes - #3796, #3854, #3863, #3866, #3984, #4001, #4011, #4013, #4020, #4022, #4023, #4060, #4062 -> `docs/about/release-notes.mdx`: Adds the v0.0.49 hardening release summary covering gateway reliability, status/doctor/shields and debug UX, OpenClaw compatibility, messaging channel teardown, Hermes policy scoping, snapshots, source installs and Docker group security note, GPU preflight, CLI usage, E2E, and CI improvements. - #3796 -> `docs/manage-sandboxes/backup-restore.mdx` and `docs/reference/commands.mdx`: Documents `snapshot restore --to` overwrite protection and the `--force` opt-in. - #3863, #4013, #4020, #4023 -> `docs/reference/commands.mdx`: Documents missing channel argument usage, sandbox-scoped custom preset matching, session policy preset sync, and gateway failure classification (uses the real probe states from `src/lib/status-command-deps.ts`). - #4022, #4060, #4062 -> `docs/reference/troubleshooting.mdx`: Adds guidance for gateway-down `connect`, source checkout OpenShell bootstrapping, WDDM placeholder GPU names, and Jetson sandbox GPU passthrough. - Release prep -> `docs/project.json`, `docs/versions1.json`, `.agents/skills/nemoclaw-user-*`: Bumps docs metadata to 0.0.49 and refreshes generated user skills from the Fern docs. ## Type of Change - [ ] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [x] Doc only (includes code sample changes) ## Verification - [x] `npx prek run --all-files` passes - [ ] `npm test` passes - [ ] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [x] Docs updated for user-facing behavior changes - [ ] `make docs` builds without warnings (doc changes only) - [x] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) \`make docs\` was attempted locally but did not complete because \`npm\` returned \`403 Forbidden\` while fetching \`fern-api\` from \`registry.npmjs.org\` in the sandboxed environment. --- Signed-off-by: Miyoung Choi <miyoungc@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Released v0.0.49 with reliability and compatibility improvements including faster gateway failure diagnostics and safer snapshot restore behavior * Enhanced snapshot restore documentation with `--to` cloning and `--force` overwrite requirements * Expanded troubleshooting guides for source installs, GPU setup, and gateway recovery * Clarified Docker group access requirements and improved CLI command reference * **Chores** * Version bumped to 0.0.49 <!-- review_stack_entry_start --> [](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/4078?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
nemoclaw <sandbox> connectreadiness polling when the named NemoClaw/OpenShell gateway is down or unreachablenemoclaw onboardProvisioningunknownreadiness status plus disconnected gateway lifecycleSupersedes #3853 with the same patch squashed onto current
mainto avoid the commit signature/merge-commit issues on that branch.Fixes #3821
Test Plan
git diff --checknpm test -- test/cli.test.ts -t "fails fast with gateway recovery guidance"(not run locally: dependencies are not installed in this worktree; CI will run)sandbox-operations-e2epassed on previous head; latest E2E advisor requiressandbox-operations-e2efor current head and it will need to be re-run here.Signed-off-by: Julie Yaunches jyaunches@nvidia.com
Summary by CodeRabbit
Bug Fixes
Tests