Skip to content

fix(doctor): handle docker-driver gateway mode (resolver + skip k3s port check)#3941

Open
luisignaciomaiz-cmyk wants to merge 2 commits into
NVIDIA:mainfrom
luisignaciomaiz-cmyk:fix/doctor-docker-driver-pr
Open

fix(doctor): handle docker-driver gateway mode (resolver + skip k3s port check)#3941
luisignaciomaiz-cmyk wants to merge 2 commits into
NVIDIA:mainfrom
luisignaciomaiz-cmyk:fix/doctor-docker-driver-pr

Conversation

@luisignaciomaiz-cmyk
Copy link
Copy Markdown

@luisignaciomaiz-cmyk luisignaciomaiz-cmyk commented May 20, 2026

Summary

nemoclaw <sandbox> doctor reports two false failures in docker-driver gateway mode (openshellDriver === "docker"):

  1. Gateway > Docker container: faildockerInspectGateway looks for a hard-coded container name openshell-cluster-${NEMOCLAW_GATEWAY_NAME} that only exists in kubernetes-driver mode.
  2. Gateway > Port mapping: fail — once the inspect succeeds, this check looks for 30051/tcp published on the host port. In docker-driver mode the gateway is a HOST process (not inside the sandbox container), so no container port mapping exists.

Both failures are diagnostic-only — the gateway itself works correctly in docker-driver mode, and OpenShell status: connected reports ok alongside the failed container check. The user is told the system is broken when it isn't.

Root cause

dockerInspectGateway in src/lib/actions/sandbox/doctor.ts was written for the legacy k3s-style cluster gateway architecture:

  • The gateway runs inside a Docker container named openshell-cluster-<gateway-name>.
  • That container exposes gRPC on port :30051.
  • The host publishes that port at NEMOCLAW_GATEWAY_PORT.

In docker-driver mode (openshellDriver === "docker"), the architecture is different:

  • Each sandbox has its own container named openshell-${sandbox-name}-${uuid}.
  • The gateway runs as a host process (not inside any container).
  • The sandbox container talks to the host gateway via host.openshell.internal.

src/lib/shields/index.ts already has a resolveDockerDriverSandboxContainer(sandboxName) helper that handles this correctly — it returns null unless openshellDriver === "docker", otherwise enumerates containers via docker ps and matches the openshell-${sandbox-name}- prefix. It's used internally for kubectlExecArgv vs. docker-driver routing in shields code. But it wasn't exported, so doctor.ts couldn't use it.

Fix

Two surgical changes (2 files, ~7 insertions, 2 deletions):

src/lib/shields/index.ts — add resolveDockerDriverSandboxContainer to the named-export block at line ~1260. No function body changes; just exposing the existing implementation.

src/lib/actions/sandbox/doctor.ts:

  • dockerInspectGateway(containerName: string)dockerInspectGateway(containerName: string, skipPortCheck: boolean = false). The new optional flag defaults to false, preserving all existing callers' behavior.
  • Inside the function, add if (skipPortCheck) return checks; before the port-mapping block. When set, the function emits the "Docker container" check and returns, skipping the k3s-only port check.
  • The single caller (line 448) computes the resolver result and passes the resolved name (or legacy fallback) plus the skip flag:
ts
const _dockerDriverContainer = shields.resolveDockerDriverSandboxContainer(sandboxName);
checks.push(...dockerInspectGateway(
  _dockerDriverContainer || `openshell-cluster-${NEMOCLAW_GATEWAY_NAME}`,
  !!_dockerDriverContainer
));
  • dockerInspectGateway is now exported (annotated /** @internal — exported for unit tests; sole production consumer is runSandboxDoctor in this file. */) so the new test file can import it directly.

Tests

Added src/lib/actions/sandbox/doctor.test.ts with two unit tests covering both branches of dockerInspectGateway:

  • skipPortCheck=true (docker-driver path) — asserts only one host command issued (docker inspect), no port probe; result contains the "Docker container" check, no "Port mapping" check.
  • skipPortCheck=false (default, K3s path) — asserts both inspect + port probes run as before; result contains both checks, port mapping reports ok when host mapping contains :GATEWAY_PORT.

Verified locally on Ubuntu 24.04 aarch64 (DGX Spark) with npx vitest run src/lib/actions/sandbox/ — 37/37 tests pass across 7 files in 4ms.

Note for reviewers about the test import path: the test imports dockerInspectGateway from ../../../../dist/lib/actions/sandbox/doctor (compiled output) rather than ./doctor (source). This follows the existing precedent in src/lib/actions/sandbox/status.test.ts. The reason: doctor.ts's transitive deps use bare CJS require("./relative") for sibling modules, which vitest cannot resolve to .ts on the fly without first running npm run build:cli. A long-term refactor converting these require() calls to ES imports would unblock direct .ts testing project-wide — happy to do that in a follow-up PR if desired.

Two other diligence notes from the test work:

  1. npm install --ignore-scripts is required to get vitest installed because the package's prepare script runs npm install --omit=dev after the build, which would strip vitest right after installing it. The --ignore-scripts flag avoids that. Worth a contributor-docs note.
  2. Mocking child_process.spawnSync for dist/-imported code required the createRequire(import.meta.url)("node:child_process") pattern. vi.spyOn against the ESM namespace fails with "Cannot redefine property: spawnSync" (the namespace is frozen). The CJS-cached module is mutable. Documented inline in the test for future contributors.

Driver-mode matrix (verified behavior)

openshellDriver | Container check | Port check -- | -- | -- "docker" | inspects real sandbox container, reports running/health/image | skipped (gateway is on host, not in container) "kubernetes" | inspects openshell-cluster- (unchanged) | runs as before (looks for 30051/tcp) "vm" (macOS) | resolver returns null, legacy fallback, inspect fails as before | runs as before, fails as before null / undefined / registry throws | resolver's try/catch returns null, legacy fallback | runs as before

All non-docker modes preserve current behavior exactly. Docker-driver mode now gets accurate diagnostics.

Reproduction (before this PR)

On Ubuntu 24.04 aarch64 (DGX Spark) with NemoClaw 2026.4.24, openshellDriver === "docker", real sandbox container openshell-jarvis-19f22332-…:

$ nemoclaw <sandbox> doctor
Gateway:
  [fail] Docker container: openshell-cluster-nemoclaw not found or not inspectable
       hint: run `docker ps --filter name=openshell-cluster-nemoclaw`
  [ok] OpenShell status: connected to nemoclaw
Summary: attention needed (1 failed, 1 warning(s))

After this PR:

$ nemoclaw <sandbox> doctor
Gateway:
  [ok] Docker container: openshell-jarvis-19f22332-… running (none; openshell/sandbox-from:1779075943)
  [ok] OpenShell status: connected to nemoclaw
Summary: healthy with 1 warning(s)

The remaining warning is Shields: down, an environmental finding unrelated to this PR.

Also verified end-to-end post-patch in the same environment:

  • Gateway gRPC/MCP traffic still works (sandbox → host MCP server → Gmail / iCloud Calendar APIs; confirmed via real bilingual tool-call exchange).
  • nemoclaw <sandbox> status output unchanged.
  • Local automated audit suite (independent of doctor) remains green.
  • No shields-related regressions.

Blast radius / risk

Minimal:

  • shields/index.ts change is purely additive (one new line in the named-export block; no function body changes). Cannot affect any existing call paths.
  • doctor.ts change adds an optional parameter (default preserves all existing callers' behavior) and a single early-return guard. The only call site that passes the new flag is the one updated. The function is annotated @internal — exported for unit tests so the export keyword shouldn't be relied on by external consumers.
  • No new dependencies, no signature changes to publicly-exported functions, no shared-state changes.
  • Tests are additive (new file, new export). Existing tests untouched.

Related findings (not in this PR)

While auditing, I also noticed stopSandboxChannelsViaKubectl in src/lib/tunnel/services.ts:439 uses the same hard-coded GATEWAY_CLUSTER_CONTAINER constant. However, this is safe and doesn't need a patch — it's wrapped in a try-then-fallback pattern: stopSandboxChannels (line ~369) tries the kubectl-via-cluster-container path first, checks the result via reportStopResult, and falls through to openshell sandbox exec with the GATEWAY_STOP_SCRIPT if the kubectl call fails or returns false. In docker-driver mode, the kubectl call fails gracefully (container doesn't exist), and the fallback runs the verified bash script inside the actual sandbox to stop the gateway. The doctor's failure was the anomaly precisely because doctor had no equivalent fallback; the gateway-stop architecture already self-protects. If a maintainer would prefer this PR also include a defensive guard inside stopSandboxChannelsViaKubectl itself (return null early when openshellDriver !== "kubernetes"), I'm happy to add it — let me know.

Also: I observed that OpenClaw's MCP runtime adapter (openclaw/openclaw repo, separate from this) silently drops the headers field on remote MCP server configs when the transport field is not set explicitly (even though sse is the documented default). This is the SSE analogue of openclaw/openclaw#65590 (streamable-http). Will file a separate issue against openclaw/openclaw referencing #65590 — wrong repo for this PR.

Local commit history on this branch

  • 6c9201062 — fix(doctor): handle docker-driver gateway mode (resolver + skip k3s port check)
  • 870844ef5 — test(doctor): cover docker-driver skipPortCheck branch of dockerInspectGateway

2 commits, 2 files net, ~123 insertions / 2 deletions. Ready to squash-merge if you prefer a single-commit history.

Summary by CodeRabbit

  • Tests

    • Added comprehensive test suite for Docker gateway inspection diagnostics.
  • Refactor

    • Improved sandbox diagnostics to better handle docker-driver containers by optimizing Docker port-mapping checks.

Review Change Stack

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 20, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 20, 2026

📝 Walkthrough

Walkthrough

This PR extends the sandbox doctor diagnostic logic to support docker-driver-specific containers by conditionally skipping port-mapping checks. The shields module now exports a container resolution helper, dockerInspectGateway gains an optional skipPortCheck parameter, and the doctor integrates both to adapt probe sequencing based on container type. Tests validate the updated behavior.

Changes

Docker-driver sandbox health checks

Layer / File(s) Summary
Export docker-driver container resolution helper
src/lib/shields/index.ts
The resolveDockerDriverSandboxContainer function is added to the module's public exports to enable sandbox doctor and test consumers to resolve docker-driver-specific gateway containers.
Add skipPortCheck parameter to dockerInspectGateway
src/lib/actions/sandbox/doctor.ts
The function is exported and extended with an optional skipPortCheck boolean parameter; when true, it returns the Docker inspect check without performing port-mapping inspection.
Integrate docker-driver container resolution into runSandboxDoctor
src/lib/actions/sandbox/doctor.ts
The sandbox doctor orchestration resolves whether a docker-driver gateway container exists and passes the appropriate skipPortCheck flag to dockerInspectGateway, adapting the probe sequence based on container type.
Test suite for dockerInspectGateway probe sequencing
src/lib/actions/sandbox/doctor.test.ts
A Vitest suite mocks spawnSync and validates that the function executes the correct Docker probe sequence (inspect-only vs. inspect+port) and produces the expected check structure.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

Docker, fix, Sandbox, NemoClaw CLI

Suggested reviewers

  • cv
  • ericksoa

Poem

A rabbit hops through Docker's gates,
Where driver containers hesitate—
Skip the ports when shields align,
Inspect the gateway, all is fine! 🐇

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: fixing docker-driver gateway mode handling by adding a resolver and skipping k3s port checks, which aligns with the core objectives and file modifications.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/actions/sandbox/doctor.ts`:
- Around line 469-470: The current code falls back to the k3s gateway name by
using `|| \`openshell-cluster-${NEMOCLAW_GATEWAY_NAME}\`` when
`shields.resolveDockerDriverSandboxContainer(sandboxName)` returns null,
producing misleading hints; change the logic so you only call
`dockerInspectGateway(...)` if `_dockerDriverContainer` is non-null (i.e., check
`if (_dockerDriverContainer) {
checks.push(...dockerInspectGateway(_dockerDriverContainer, true)) }`) and do
not pass the k3s fallback name when the resolver returned null—this prevents
false positives and keeps `dockerInspectGateway` behavior tied to the actual
resolved container.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 54471912-b4b4-4c96-86f8-8e0f7623ca89

📥 Commits

Reviewing files that changed from the base of the PR and between 139f2b3 and 870844e.

📒 Files selected for processing (3)
  • src/lib/actions/sandbox/doctor.test.ts
  • src/lib/actions/sandbox/doctor.ts
  • src/lib/shields/index.ts

Comment on lines +469 to +470
const _dockerDriverContainer = shields.resolveDockerDriverSandboxContainer(sandboxName);
checks.push(...dockerInspectGateway(_dockerDriverContainer || `openshell-cluster-${NEMOCLAW_GATEWAY_NAME}`, !!_dockerDriverContainer));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Avoid falling back to k3s gateway name for docker-driver sandboxes when resolver returns null.

At Line 469 and Line 470, a docker-driver sandbox with a stopped/non-running container can resolve to null (resolver uses running containers), then incorrectly falls back to openshell-cluster-nemoclaw. That yields a misleading architecture-specific failure/hint.

💡 Suggested fix
-  const _dockerDriverContainer = shields.resolveDockerDriverSandboxContainer(sandboxName);
-  checks.push(...dockerInspectGateway(_dockerDriverContainer || `openshell-cluster-${NEMOCLAW_GATEWAY_NAME}`, !!_dockerDriverContainer));
+  const dockerDriverContainer = shields.resolveDockerDriverSandboxContainer(sandboxName);
+  const sandboxEntry = registry.getSandbox(sandboxName);
+  const isDockerDriver = sandboxEntry?.openshellDriver === "docker";
+  const gatewayTarget = isDockerDriver
+    ? (dockerDriverContainer || `openshell-${sandboxName}`)
+    : `openshell-cluster-${NEMOCLAW_GATEWAY_NAME}`;
+  checks.push(...dockerInspectGateway(gatewayTarget, isDockerDriver));
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/actions/sandbox/doctor.ts` around lines 469 - 470, The current code
falls back to the k3s gateway name by using `||
\`openshell-cluster-${NEMOCLAW_GATEWAY_NAME}\`` when
`shields.resolveDockerDriverSandboxContainer(sandboxName)` returns null,
producing misleading hints; change the logic so you only call
`dockerInspectGateway(...)` if `_dockerDriverContainer` is non-null (i.e., check
`if (_dockerDriverContainer) {
checks.push(...dockerInspectGateway(_dockerDriverContainer, true)) }`) and do
not pass the k3s fallback name when the resolver returned null—this prevents
false positives and keeps `dockerInspectGateway` behavior tied to the actual
resolved container.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant