Skip to content

fix(cli): skip stale k3s gateway container check in Docker-driver doctor (#4502)#4646

Merged
cv merged 3 commits into
NVIDIA:mainfrom
yimoj:fix/4502-doctor-docker-gateway-name
Jun 2, 2026
Merged

fix(cli): skip stale k3s gateway container check in Docker-driver doctor (#4502)#4646
cv merged 3 commits into
NVIDIA:mainfrom
yimoj:fix/4502-doctor-docker-gateway-name

Conversation

@yimoj
Copy link
Copy Markdown
Contributor

@yimoj yimoj commented Jun 2, 2026

Summary

nemoclaw doctor always reported a [fail] Gateway check in Docker-driver mode because it unconditionally inspected the legacy k3s container openshell-cluster-nemoclaw, which only exists for the Kubernetes gateway driver. This makes doctor gate the legacy inspect on the gateway driver so healthy Docker-driver installs are no longer falsely marked unhealthy.

Related Issue

Fixes #4502

Changes

  • Gate the legacy openshell-cluster-<name> container inspect on the gateway driver in src/lib/actions/sandbox/doctor.ts: skip it for docker/vm drivers, run it for kubernetes, and fall back to platform detection (isLinuxDockerDriverGatewayEnabled()) for older registry entries that predate the openshellDriver field.
  • Rely on the existing authoritative OpenShell status check for the Docker-driver gateway (host process or nemoclaw-openshell-gateway compat container), which already runs immediately afterward.
  • De-hardcode the "not found" hint to reference the actually-inspected container name instead of the literal openshell-cluster-nemoclaw.
  • Add regression tests in test/cli.test.ts: Docker-driver mode (legacy container absent but gateway healthy → no false failure, no openshell-cluster mention) and kubernetes driver (legacy container still inspected).

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • npx prek run --all-files passes
  • npm test passes
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes
  • npm run docs builds without warnings (doc changes only)
  • Doc pages follow the style guide (doc changes only)
  • New doc pages include SPDX header and frontmatter (new pages only)

Notes on verification:

  • Reproduced end-to-end with this worktree's CLI (node ./bin/nemoclaw.js <sandbox> doctor --json) on a live Linux Docker-driver host: before the fix the Gateway group showed [fail] Docker container: openshell-cluster-nemoclaw not found while OpenShell status reported connected to nemoclaw; after the fix the stale check is gone and the Gateway group relies on the healthy OpenShell status check.
  • npm run typecheck:cli and npx biome lint src/lib/actions/sandbox/doctor.ts pass. Targeted doctor tests pass (7/7). Remaining npm test failures on the build host are pre-existing/environmental (load-induced 5000ms CLI-spawn timeouts and a umask-0027 config-sync mode-bits assertion that fails identically on the clean base) and are unrelated to this change.

Signed-off-by: Yimo Jiang yimoj@nvidia.com

Summary by CodeRabbit

  • Bug Fixes
    • Prevented false "legacy gateway" failures in the sandbox diagnostic tool by only checking legacy containers when the gateway driver requires it; diagnostics now show the actual container name when present.
  • Tests
    • Added CLI tests to confirm correct diagnostic behavior for Docker vs Kubernetes gateway scenarios.

`nemoclaw doctor` unconditionally inspected the legacy k3s gateway
container `openshell-cluster-nemoclaw`, which only exists for the
Kubernetes gateway driver. On the current Linux/arm64 Docker-driver
gateway (host process or `nemoclaw-openshell-gateway` compat container)
that inspect always fails, so doctor reported a false `[fail]` even when
OpenShell status said `connected to nemoclaw` (NVIDIA#4502).

Gate the legacy container inspect on the gateway driver: skip it for
docker/vm drivers, run it for kubernetes, and fall back to platform
detection for older registry entries that predate `openshellDriver`.
The authoritative OpenShell status check already covers the
Docker-driver gateway. Also de-hardcode the not-found hint to the
inspected container name.

Add regression tests covering Docker-driver mode (legacy container
absent but gateway healthy -> no false failure) and the kubernetes
driver (legacy container still inspected).

Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 2, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 9841b6cb-2cdd-46f6-9a93-65ec4a98806e

📥 Commits

Reviewing files that changed from the base of the PR and between 9962844 and fddda90.

📒 Files selected for processing (1)
  • test/cli.test.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/cli.test.ts

📝 Walkthrough

Walkthrough

Doctor now only inspects legacy openshell-cluster-* Docker containers when the sandbox driver is Kubernetes or when platform-level fallback indicates the legacy Docker gateway driver; the inspection call is guarded by a new helper and tests plus an improved error hint were added.

Changes

Legacy Gateway Container Inspection Fix

Layer / File(s) Summary
Conditional legacy gateway inspection
src/lib/actions/sandbox/doctor.ts
Adds import for isLinuxDockerDriverGatewayEnabled, introduces shouldInspectLegacyGatewayContainer(sb) helper to decide whether to probe the legacy openshell-cluster-<name> container based on sb.openshellDriver with a platform fallback, and gates the dockerInspectGateway() call in runSandboxDoctor.
Error message generalization and driver-mode tests
src/lib/actions/sandbox/doctor.ts, test/cli.test.ts
Docker inspection failure hint now references the actual inspected containerName. Two Vitest cases assert that Docker-mode sandboxes skip the legacy container inspection (no "Docker container" check, no openshell-cluster strings) and Kubernetes-mode sandboxes still inspect it (check present and references openshell-cluster-nemoclaw).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • NVIDIA/NemoClaw#4608: Both PRs add driver-aware gating to avoid inspecting the legacy openshell-cluster-* k3s gateway container on docker/vm sandboxes.

Suggested labels

Docker, fix, NemoClaw CLI, OpenShell, Sandbox

Suggested reviewers

  • cv
  • cjagwani

Poem

🐰 I nibble code where gates reside,
I peek for containers, but only when I should.
Docker naps undisturbed, Kubernetes checked with pride,
No phantom fails — the burrow's quiet and good.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: fixing a false Gateway check failure by skipping the legacy k3s container check when using Docker driver.
Linked Issues check ✅ Passed The PR implements the fix requested in #4502: conditional skipping of stale k3s container lookup for Docker-driver environments and proper container name handling.
Out of Scope Changes check ✅ Passed All changes directly address the scope of #4502: doctor gateway check fix, container inspection logic, and regression test coverage.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
test/cli.test.ts (1)

2755-2788: ⚡ Quick win

Assert that docker inspect was never invoked.

This test proves the user-visible outcome, but not the core contract from the fix. If doctor still called docker inspect and then ignored that failure, these assertions would still pass. Recording fake Docker argv here and asserting no inspect call occurs would make the regression much tighter.

Suggested test hardening
+    const dockerCalls = path.join(setup.home, "docker-calls");
     fs.writeFileSync(
       path.join(setup.localBin, "docker"),
       [
         "#!/usr/bin/env bash",
+        `printf '%s\\n' "$*" >> ${JSON.stringify(dockerCalls)}`,
         'if [ "$1" = "info" ]; then echo "24.0.0"; exit 0; fi',
         'if [ "$1" = "inspect" ]; then echo "Error: No such object: $3" >&2; exit 1; fi',
         "exit 0",
       ].join("\n"),
       { mode: 0o755 },
@@
     expect(report.checks.find((check) => check.label === "Docker container")).toBeUndefined();
+    expect(fs.readFileSync(dockerCalls, "utf8")).not.toMatch(/\binspect\b/);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/cli.test.ts` around lines 2755 - 2788, Add an assertion that the fake
docker binary was never invoked with the "inspect" subcommand by modifying the
stub created at path.join(setup.localBin, "docker") to record its invocations
(e.g., append "$@" to a temp log file) and after calling setup.runDoctor("alpha
doctor --json") assert that the log file does not contain the word "inspect";
locate the docker stub creation in the test (the fs.writeFileSync that writes
the docker script) and the test invocation using setup.runDoctor and r to add
the logging behavior and the assertion against the recorded invocations.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@test/cli.test.ts`:
- Around line 2755-2788: Add an assertion that the fake docker binary was never
invoked with the "inspect" subcommand by modifying the stub created at
path.join(setup.localBin, "docker") to record its invocations (e.g., append "$@"
to a temp log file) and after calling setup.runDoctor("alpha doctor --json")
assert that the log file does not contain the word "inspect"; locate the docker
stub creation in the test (the fs.writeFileSync that writes the docker script)
and the test invocation using setup.runDoctor and r to add the logging behavior
and the assertion against the recorded invocations.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f13549ec-373c-43cd-b13f-7db867bb20a9

📥 Commits

Reviewing files that changed from the base of the PR and between cab8f39 and 3b49be6.

📒 Files selected for processing (2)
  • src/lib/actions/sandbox/doctor.ts
  • test/cli.test.ts

Address CodeRabbit nitpick on NVIDIA#4646: the Docker-driver doctor regression
asserted the user-visible outcome but not the core contract. Record the
fake docker argv and assert `docker inspect` is never invoked, so the
test fails if doctor ever inspects the legacy k3s container and merely
ignores the failure.

Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
@yimoj yimoj added the v0.0.57 Release target label Jun 2, 2026
Copy link
Copy Markdown
Contributor

@prekshivyas prekshivyas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

APPROVE.

Correctly gates the legacy openshell-cluster-<name> k3s container inspect on the gateway driver — skip for docker/vm, keep for kubernetes, platform-fallback for pre-openshellDriver entries — matching the driver taxonomy already used in gateway-failure-classifier.ts. The Docker-driver regression test asserts docker inspect is never called (not merely that its failure is tolerated), so it fails on pre-fix code. CI green on 9962844, mergeable, zero open threads. Resolves #4502.

Non-blocking nits:

  • src/lib/actions/dns/index.ts:193,275 still filter docker ps on the same stale openshell-cluster name this PR removes from doctor — likely the same latent issue in another path; worth a follow-up.
  • The vm and null/platform-fallback branches of shouldInspectLegacyGatewayContainer have no test.

Signed-off-by: Prekshi Vyas prekshiv@nvidia.com

@prekshivyas prekshivyas self-assigned this Jun 2, 2026
@cv cv merged commit c32e797 into NVIDIA:main Jun 2, 2026
30 checks passed
cv pushed a commit that referenced this pull request Jun 3, 2026
## Summary
- Add the missing `v0.0.57` release-notes section with links to the
detailed docs pages for command, inference, onboarding, messaging,
status, installer, and policy changes.
- Remove public references to docs-skip terms from source docs and
regenerate the NemoClaw user skills from the current Fern MDX docs.
- Carry forward generated references for the per-agent documentation
split, including Hermes-specific reference files.

## Source summary
- #4615 and #4653 -> `docs/about/release-notes.mdx`,
`docs/reference/commands.mdx`: Release notes now cover host-side
`sessions` and `agents` commands plus `NEMOCLAW_EXTRA_AGENTS_JSON`
secondary-agent baking.
- #4163, #4204, #4611, #4619, and #4676 ->
`docs/about/release-notes.mdx`,
`docs/inference/use-local-inference.mdx`: Release notes now cover
managed vLLM progress/readiness, DGX Spark model default changes, local
Ollama streaming usage, and inference route divergence warnings.
- #4267, #4601, #4609, #4642, #4645, and #4661 ->
`docs/about/release-notes.mdx`, `docs/reference/commands.mdx`: Release
notes now cover UFW auto-remediation, local-inference reachability
gates, gateway reuse/binding, cancel rollback, and policy selection
persistence.
- #4577, #4582, #4607, and #4660 -> `docs/about/release-notes.mdx`,
`docs/manage-sandboxes/messaging-channels.mdx`: Release notes now cover
Slack validation, atomic `channels add`, WhatsApp QR diagnostics, and
Slack placeholder normalization.
- #4388, #4600, #4646, and #4647 -> `docs/about/release-notes.mdx`,
`docs/reference/commands.mdx`: Release notes now cover status failure
layers, paused-container hints, Docker-driver doctor behavior, and
non-destructive stale-registry recovery.
- #4569, #4579, and #4678 -> `docs/about/release-notes.mdx`,
`docs/manage-sandboxes/lifecycle.mdx`,
`docs/network-policy/integration-policy-examples.mdx`: Release notes now
cover installer tag pinning, PyPI `uv` policy access, and observable
Jira validation.
- #4632 -> `.agents/skills/`: Regenerated user skills from the current
per-agent docs source, including newly generated Hermes reference files.

## Verification
- `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix
nemoclaw-user --doc-platform fern-mdx`
- `rg "permissive mode|shields down|shields up|shields status|config
rotate-token|rotate-token" docs --glob "*.mdx"`
- `rg "permissive mode|shields down|shields up|shields status|config
rotate-token|rotate-token" .agents/skills --glob "*.md"`
- `npm run docs`
- `npm run build:cli`
- Commit hooks: markdownlint, docs-to-skills verification, gitleaks,
skills YAML, commitlint

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Documentation**
* Restructured documentation to clearly distinguish OpenClaw and Hermes
agent variants throughout user guides.
* Enhanced security, credential storage, and deployment guidance with
clearer setup flows.
  * Added Hermes plugin installation and ecosystem documentation.
* Improved workspace, messaging, and policy management references with
variant-specific command examples.
  * Refined troubleshooting and CLI reference sections for clarity.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

v0.0.57 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Ubuntu 24.04][CLI&UX] nemoclaw doctor Gateway check uses k3s container name, always [fail] in Docker mode

4 participants