fix(cli): skip stale k3s gateway container check in Docker-driver doctor (#4502)#4646
Conversation
`nemoclaw doctor` unconditionally inspected the legacy k3s gateway container `openshell-cluster-nemoclaw`, which only exists for the Kubernetes gateway driver. On the current Linux/arm64 Docker-driver gateway (host process or `nemoclaw-openshell-gateway` compat container) that inspect always fails, so doctor reported a false `[fail]` even when OpenShell status said `connected to nemoclaw` (NVIDIA#4502). Gate the legacy container inspect on the gateway driver: skip it for docker/vm drivers, run it for kubernetes, and fall back to platform detection for older registry entries that predate `openshellDriver`. The authoritative OpenShell status check already covers the Docker-driver gateway. Also de-hardcode the not-found hint to the inspected container name. Add regression tests covering Docker-driver mode (legacy container absent but gateway healthy -> no false failure) and the kubernetes driver (legacy container still inspected). Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughDoctor now only inspects legacy openshell-cluster-* Docker containers when the sandbox driver is Kubernetes or when platform-level fallback indicates the legacy Docker gateway driver; the inspection call is guarded by a new helper and tests plus an improved error hint were added. ChangesLegacy Gateway Container Inspection Fix
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
test/cli.test.ts (1)
2755-2788: ⚡ Quick winAssert that
docker inspectwas never invoked.This test proves the user-visible outcome, but not the core contract from the fix. If
doctorstill calleddocker inspectand then ignored that failure, these assertions would still pass. Recording fake Docker argv here and asserting noinspectcall occurs would make the regression much tighter.Suggested test hardening
+ const dockerCalls = path.join(setup.home, "docker-calls"); fs.writeFileSync( path.join(setup.localBin, "docker"), [ "#!/usr/bin/env bash", + `printf '%s\\n' "$*" >> ${JSON.stringify(dockerCalls)}`, 'if [ "$1" = "info" ]; then echo "24.0.0"; exit 0; fi', 'if [ "$1" = "inspect" ]; then echo "Error: No such object: $3" >&2; exit 1; fi', "exit 0", ].join("\n"), { mode: 0o755 }, @@ expect(report.checks.find((check) => check.label === "Docker container")).toBeUndefined(); + expect(fs.readFileSync(dockerCalls, "utf8")).not.toMatch(/\binspect\b/);🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/cli.test.ts` around lines 2755 - 2788, Add an assertion that the fake docker binary was never invoked with the "inspect" subcommand by modifying the stub created at path.join(setup.localBin, "docker") to record its invocations (e.g., append "$@" to a temp log file) and after calling setup.runDoctor("alpha doctor --json") assert that the log file does not contain the word "inspect"; locate the docker stub creation in the test (the fs.writeFileSync that writes the docker script) and the test invocation using setup.runDoctor and r to add the logging behavior and the assertion against the recorded invocations.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@test/cli.test.ts`:
- Around line 2755-2788: Add an assertion that the fake docker binary was never
invoked with the "inspect" subcommand by modifying the stub created at
path.join(setup.localBin, "docker") to record its invocations (e.g., append "$@"
to a temp log file) and after calling setup.runDoctor("alpha doctor --json")
assert that the log file does not contain the word "inspect"; locate the docker
stub creation in the test (the fs.writeFileSync that writes the docker script)
and the test invocation using setup.runDoctor and r to add the logging behavior
and the assertion against the recorded invocations.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: f13549ec-373c-43cd-b13f-7db867bb20a9
📒 Files selected for processing (2)
src/lib/actions/sandbox/doctor.tstest/cli.test.ts
Address CodeRabbit nitpick on NVIDIA#4646: the Docker-driver doctor regression asserted the user-visible outcome but not the core contract. Record the fake docker argv and assert `docker inspect` is never invoked, so the test fails if doctor ever inspects the legacy k3s container and merely ignores the failure. Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
prekshivyas
left a comment
There was a problem hiding this comment.
APPROVE.
Correctly gates the legacy openshell-cluster-<name> k3s container inspect on the gateway driver — skip for docker/vm, keep for kubernetes, platform-fallback for pre-openshellDriver entries — matching the driver taxonomy already used in gateway-failure-classifier.ts. The Docker-driver regression test asserts docker inspect is never called (not merely that its failure is tolerated), so it fails on pre-fix code. CI green on 9962844, mergeable, zero open threads. Resolves #4502.
Non-blocking nits:
src/lib/actions/dns/index.ts:193,275still filterdocker pson the same staleopenshell-clustername this PR removes from doctor — likely the same latent issue in another path; worth a follow-up.- The
vmand null/platform-fallback branches ofshouldInspectLegacyGatewayContainerhave no test.
Signed-off-by: Prekshi Vyas prekshiv@nvidia.com
## Summary - Add the missing `v0.0.57` release-notes section with links to the detailed docs pages for command, inference, onboarding, messaging, status, installer, and policy changes. - Remove public references to docs-skip terms from source docs and regenerate the NemoClaw user skills from the current Fern MDX docs. - Carry forward generated references for the per-agent documentation split, including Hermes-specific reference files. ## Source summary - #4615 and #4653 -> `docs/about/release-notes.mdx`, `docs/reference/commands.mdx`: Release notes now cover host-side `sessions` and `agents` commands plus `NEMOCLAW_EXTRA_AGENTS_JSON` secondary-agent baking. - #4163, #4204, #4611, #4619, and #4676 -> `docs/about/release-notes.mdx`, `docs/inference/use-local-inference.mdx`: Release notes now cover managed vLLM progress/readiness, DGX Spark model default changes, local Ollama streaming usage, and inference route divergence warnings. - #4267, #4601, #4609, #4642, #4645, and #4661 -> `docs/about/release-notes.mdx`, `docs/reference/commands.mdx`: Release notes now cover UFW auto-remediation, local-inference reachability gates, gateway reuse/binding, cancel rollback, and policy selection persistence. - #4577, #4582, #4607, and #4660 -> `docs/about/release-notes.mdx`, `docs/manage-sandboxes/messaging-channels.mdx`: Release notes now cover Slack validation, atomic `channels add`, WhatsApp QR diagnostics, and Slack placeholder normalization. - #4388, #4600, #4646, and #4647 -> `docs/about/release-notes.mdx`, `docs/reference/commands.mdx`: Release notes now cover status failure layers, paused-container hints, Docker-driver doctor behavior, and non-destructive stale-registry recovery. - #4569, #4579, and #4678 -> `docs/about/release-notes.mdx`, `docs/manage-sandboxes/lifecycle.mdx`, `docs/network-policy/integration-policy-examples.mdx`: Release notes now cover installer tag pinning, PyPI `uv` policy access, and observable Jira validation. - #4632 -> `.agents/skills/`: Regenerated user skills from the current per-agent docs source, including newly generated Hermes reference files. ## Verification - `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix nemoclaw-user --doc-platform fern-mdx` - `rg "permissive mode|shields down|shields up|shields status|config rotate-token|rotate-token" docs --glob "*.mdx"` - `rg "permissive mode|shields down|shields up|shields status|config rotate-token|rotate-token" .agents/skills --glob "*.md"` - `npm run docs` - `npm run build:cli` - Commit hooks: markdownlint, docs-to-skills verification, gitleaks, skills YAML, commitlint <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Restructured documentation to clearly distinguish OpenClaw and Hermes agent variants throughout user guides. * Enhanced security, credential storage, and deployment guidance with clearer setup flows. * Added Hermes plugin installation and ecosystem documentation. * Improved workspace, messaging, and policy management references with variant-specific command examples. * Refined troubleshooting and CLI reference sections for clarity. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
nemoclaw doctoralways reported a[fail]Gateway check in Docker-driver mode because it unconditionally inspected the legacy k3s containeropenshell-cluster-nemoclaw, which only exists for the Kubernetes gateway driver. This makes doctor gate the legacy inspect on the gateway driver so healthy Docker-driver installs are no longer falsely marked unhealthy.Related Issue
Fixes #4502
Changes
openshell-cluster-<name>container inspect on the gateway driver insrc/lib/actions/sandbox/doctor.ts: skip it fordocker/vmdrivers, run it forkubernetes, and fall back to platform detection (isLinuxDockerDriverGatewayEnabled()) for older registry entries that predate theopenshellDriverfield.nemoclaw-openshell-gatewaycompat container), which already runs immediately afterward.openshell-cluster-nemoclaw.test/cli.test.ts: Docker-driver mode (legacy container absent but gateway healthy → no false failure, noopenshell-clustermention) and kubernetes driver (legacy container still inspected).Type of Change
Verification
npx prek run --all-filespassesnpm testpassesnpm run docsbuilds without warnings (doc changes only)Notes on verification:
node ./bin/nemoclaw.js <sandbox> doctor --json) on a live Linux Docker-driver host: before the fix the Gateway group showed[fail] Docker container: openshell-cluster-nemoclaw not foundwhileOpenShell statusreportedconnected to nemoclaw; after the fix the stale check is gone and the Gateway group relies on the healthy OpenShell status check.npm run typecheck:cliandnpx biome lint src/lib/actions/sandbox/doctor.tspass. Targeted doctor tests pass (7/7). Remainingnpm testfailures on the build host are pre-existing/environmental (load-induced 5000ms CLI-spawn timeouts and a umask-0027config-syncmode-bits assertion that fails identically on the clean base) and are unrelated to this change.Signed-off-by: Yimo Jiang yimoj@nvidia.com
Summary by CodeRabbit