Skip to content

fix(onboard): auto-patch cluster image for Docker 26+ overlayfs nested mount break#2487

Merged
ericksoa merged 12 commits intomainfrom
fix/overlayfs-snapshotter-2481
Apr 26, 2026
Merged

fix(onboard): auto-patch cluster image for Docker 26+ overlayfs nested mount break#2487
ericksoa merged 12 commits intomainfrom
fix/overlayfs-snapshotter-2481

Conversation

@ericksoa
Copy link
Copy Markdown
Contributor

@ericksoa ericksoa commented Apr 26, 2026

Summary

Docker 26+ flips the install-time default storage driver from the legacy overlay2 graph driver to the new containerd image-store with overlayfs snapshotter. The k3s server inside the upstream OpenShell cluster image relies on the host overlayfs supporting nested overlay mounts, which the new snapshotter does not. Users hit a fatal k3s loop with "overlayfs" snapshotter cannot be enabled... try using "fuse-overlayfs" or "native" after a few minutes.

NemoClaw now detects the offending Docker configuration during onboard and transparently builds a small drop-in cluster image locally that installs fuse-overlayfs and selects it as the k3s snapshotter. No published image, no host configuration changes, no sudo, no Docker restart.

Related Issue

Refs #2481.

Changes

  • src/lib/preflight.tsparseDockerStorageDriver and parseDockerUsesContainerdSnapshotter parsers; new HostAssessment fields dockerStorageDriver, dockerUsesContainerdSnapshotter, hasNestedOverlayConflict. Conflict scoped to Linux + docker runtime + Driver: overlayfs + DriverStatus mentions io.containerd.snapshotter.v1. macOS Docker Desktop, legacy overlay2, and containerd-disabled Linux installs are not flagged.
  • src/lib/cluster-image-patch.ts (new) — ensurePatchedClusterImage builds a deterministic nemoclaw-cluster:<version>-<snapshotter>-<sha> image via a multi-stage Dockerfile. The fuse-overlayfs path uses a ubuntu:24.04 builder pinned by digest (UBUNTU_BUILDER_DIGEST constant) that apt-get installs the package and stages the binary plus its libfuse3.so.3 runtime dep under /export; the final stage COPY --from=bin-fetchers both into the cluster image and runs ldconfig. The native path skips the builder entirely and is just FROM ${UPSTREAM} + CMD ["server", "--snapshotter=native"]. Cache hit via docker image inspect. Three timeout knobs (pullTimeoutMs 10m, buildTimeoutMs 5m, inspectTimeoutMs 30s) protect onboard from a stuck daemon. Throws ClusterImagePatchError on failure with diagnostic context.
  • src/lib/onboard.tsapplyOverlayfsAutoFix consults assessHost, calls ensurePatchedClusterImage when the conflict is detected, and overrides OPENSHELL_CLUSTER_IMAGE for openshell gateway start. Two override env vars: NEMOCLAW_DISABLE_OVERLAY_FIX=1 skips the auto-fix; NEMOCLAW_OVERLAY_SNAPSHOTTER=native switches to the k3s native snapshotter (more disk, no FUSE dependency).
  • src/lib/preflight.test.ts — 11 new test cases (parser fixtures from the bug report, conflict + non-conflict assessments, macOS false-positive guard).
  • src/lib/cluster-image-patch.test.ts (new) — 19 test cases covering Dockerfile shape (multi-stage builder, no github.com link, no apt/curl in final stage, native variant skips fuse-overlayfs, ubuntu builder digest-pinned), version extraction (registry-port, registry-port-plus-tag, trailing-colon edge cases), tag determinism, cache-hit/cache-miss/build-failure/pull-failure with injectable runners and a mock fs surface.
  • test/e2e/test-overlayfs-autofix.sh (new) — temporary nightly E2E. Forces the GitHub-Actions runner's Docker onto containerd-snapshotter overlayfs via daemon.json, asserts the auto-fix triggers + the gateway runs on the patched image + the cluster log is clean (positive phase); calls ensurePatchedClusterImage directly to verify the cache-hit invariant (idempotency phase); runs install.sh with NEMOCLAW_DISABLE_OVERLAY_FIX=1 under a bounded timeout and asserts non-zero exit (negative phase). Per-run mktemp -d STATE_DIR for daemon backup state. Removal criteria documented in the script header.
  • .github/workflows/nightly-e2e.yaml — adds the temporary overlayfs-autofix-e2e job and entry in notify-on-failure: needs:. Comments call out removal criteria.
  • docs/reference/troubleshooting.md — new section explaining root cause, auto-fix, override knobs, and the manual daemon.json alternative. Reflowed to one sentence per source line.
  • docs/get-started/prerequisites.md — one-line note about transparent storage-driver handling. One sentence per source line.
  • .agents/skills/nemoclaw-user-* — auto-regenerated from docs by the existing pre-commit hook.

Type of Change

  • Code change with doc updates

Verification

  • npx prek run --all-files passes (with SKIP=test-cli to skip pre-existing flaky CLI vitest tests on this dev machine — same 6 tests fail on stock main HEAD verified by stashing locally; behavior covered by CI on a faster runner)
  • npm run typecheck:cli passes
  • npm run build:cli passes
  • cd nemoclaw && npx tsc --noEmit --incremental passes
  • npx tsc -p jsconfig.json passes
  • New tests added: 16 cases in preflight.test.ts, 22 cases in cluster-image-patch.test.ts — all pass
  • No secrets or credentials committed
  • Docs updated for user-facing behavior change
  • End-to-end validated on a real Docker 26+ host via the new nightly overlayfs-autofix-e2e job (run 24966948867): 15 PASS / 0 FAIL / 1 SKIP. The SKIP is documented (canonical k3s startup error string is absent on this runner kernel because the same root cause surfaces at sandbox-build time instead — non-zero install.sh exit still proves the auto-fix is load-bearing).

Notes for reviewers

Architectural choices

  • No published wrapper image. The patched image is built locally on the user's machine via docker build. No GHCR namespace, no signing pipeline, no SBOM, no security-boundary expansion to maintain.
  • Multi-stage Dockerfile, builder via apt rather than upstream URL. The cluster image's noble base ships BusyBox tar (so apt-get install directly into it fails with a dpkg-deb extraction error) AND lacks curl (so an in-place static-binary download via RUN curl … exits 127). Both attempts were tried during iteration and replaced. The current shape uses a clean ubuntu:24.04 builder stage (digest-pinned for reproducibility) that runs apt-get install fuse-overlayfs, then COPY --froms the binary plus its sole non-libc shared dependency (libfuse3.so.3) into the cluster image.
  • native snapshotter path is intentionally minimal. buildPatchDockerfile("native") returns a four-line Dockerfile (FROM ${UPSTREAM} + CMD) — no builder, no apt, no fuse-overlayfs install. Selecting NEMOCLAW_OVERLAY_SNAPSHOTTER=native should cost the user nothing they don't need.
  • Three timeouts on the build path. pullTimeoutMs (10 min), buildTimeoutMs (5 min), and inspectTimeoutMs (30 s) protect the entire patched-image flow — including the docker image inspect cache probe — from a stuck Docker daemon.
  • Hermetic env in the e2e. Phase-3 (positive) and phase-4 (idempotency) env invocations explicitly env -u NEMOCLAW_DISABLE_OVERLAY_FIX -u NEMOCLAW_OVERLAY_SNAPSHOTTER so a caller's environment can't silently flip the path under test.
  • Per-run mktemp -d STATE_DIR for daemon backup. The previous /tmp/nemoclaw-e2e-daemon.json.{bak,absent} paths were predictable enough that a pre-created symlink could redirect the script's sudo cp of /etc/docker/daemon.json into an attacker-chosen file on a shared runner. Both files now live inside a 0700 mktemp directory, wiped by the EXIT trap.

Adversarial review findings (resolved in e0d8df15)

Spawned a fresh-context adversarial reviewer to look for what CodeRabbit and I missed. Eight concerns surfaced; all addressed.

# Severity Concern Status
1 Major getGatewayStartEnv() is called from both startGatewayWithOptions and recoverGatewayRuntime. The retry-storm impact was overstated by the reviewer (the retry closure captures gatewayEnv from outer scope; build runs once per call, not once per retry), but the recovery path does redundantly re-run assessHost. Resolved. applyOverlayfsAutoFix now memoizes per upstream image; the recovery path returns the cached patched-tag without re-running assessment or attempting another build.
2 Major computePatchedTag SHAs the Dockerfile text, decoupled from the upstream image's actual content digest. If GHCR re-pushes ghcr.io/nvidia/openshell/cluster:<v> with a security fix, imageExists reports cache-hit and the user keeps running the patched image with stale upstream layers. Resolved. Cache key now includes the resolved upstream content digest. ensurePatchedClusterImage calls docker image inspect first (zero network cost when warm; works air-gapped with pre-staged images) and folds the digest into the SHA. A re-pushed upstream produces a different digest → different patched tag → next onboard rebuilds. New unit test asserts digest changes invalidate the tag.
3 Major First-time onboard on an air-gapped/restricted host hits a 10 min docker pull timeout + 5 min docker build timeout before any feedback. Resolved. When the upstream image isn't already local, a docker manifest inspect reachability probe (bounded by inspectTimeoutMs, default 30s) runs before committing to the long pull. Failure surfaces as a typed error with a pointer to the manual daemon.json workaround in troubleshooting docs. New unit test: "fails fast with a documented error when upstream is unreachable on a cache miss".
4 Minor assessment.isWsl is computed and ignored. Native Docker inside WSL2 (no Desktop integration) would currently trigger the auto-fix, but the WSL kernel may not actually have the nested-overlay break. Resolved. hasNestedOverlayConflict now requires !isWsl. Conservative — leave WSL alone until we have a confirmed reproducer. New unit test: "does not flag a WSL2 Linux host as a conflict".
5 Minor applyOverlayfsAutoFix silently swallows assessHost exceptions — no log breadcrumb if a future regression makes the auto-fix mysteriously stop firing. Resolved. Added console.warn with the error message before returning null.
6 Minor Phase-5 negative e2e test accepts "any non-zero install.sh exit" as a pass; flakes (NVIDIA_API_KEY rejection, GHCR rate-limit, etc.) look identical to "auto-fix is load-bearing". Resolved. Negative phase now grep's the cluster log + install log for at least one of three nested-overlay-failure signatures (overlayfs snapshotter cannot be enabled, CreateDiff: Canceled, failed to mount overlay). FAIL if none present — previously an unrelated install failure would have papered over as a SKIP.
7 Minor Manual daemon.json workaround in docs warns about killing containers but omits that the image graph rebuilds on the new storage driver — previously-cached images become unusable. Resolved. Added a sentence explaining the cold-cache cost of switching drivers.
8 Nit parseDockerStorageDriver only matches the JSON form; future plain-text callers would silently miss the conflict. Resolved. Added a fallback regex for the plain-text Storage Driver: <name> form. New unit test for the plain-text fixture.

The reviewer also explicitly verified — and could not break — multi-arch correctness, registry-port edge cases, tag determinism, runCapture timeout threading, e2e symlink-redirect protection, and the set -e discipline. Those guards hold.

Strategic context

  • The auto-fix is a downstream bridge until OpenShell's roadmap migration off k3s (NVIDIA/OpenShell#873) lands. OpenShell explicitly closed #756 (the same nested-overlay class of bug on OrbStack) wontfix in favor of that migration.
  • The temporary nightly job (overlayfs-autofix-e2e) and the cluster-image-patch.ts module should be removed in the same PR that retires this code path once OpenShell ships a non-k3s default sandbox driver. Removal criteria are documented in both the script header and the workflow job comment.

What the nightly e2e revealed and how it shaped the final shape

Iteration history (commits visible in the branch log):

Attempt Result on the GitHub Actions runner Lesson
RUN apt-get install fuse-overlayfs directly into the cluster image dpkg-deb failed: BusyBox tar doesn't accept --warning=no-timestamp Cluster image's noble base lacks GNU tar
RUN curl -fsSL …upstream-release… && chmod +x … exit 127 Cluster image lacks curl too
Multi-stage FROM alpine:3.20 + apk add curl + curl …upstream-release… Hit "No External Project Links" policy violation (CONTRIBUTING.md) GitHub URLs in source are policy-prohibited
Multi-stage FROM ubuntu:24.04 + apt-get install fuse-overlayfs + COPY binary + libfuse3.so.3 Working. Final shape.

Each iteration was caught by the temporary nightly e2e before reviewer time was spent on it. The same e2e also caught a phase-4 design flaw where re-running install.sh was racing the runner kernel's nested-overlay limits — restructured to call ensurePatchedClusterImage directly from Node instead.

Standing items reviewers should know

  • Pre-existing local flakes. test-cli (Vitest CLI project) has 6 timeout-prone tests in test/onboard.test.ts and one legacy-path-guard test that fail on stock main HEAD on this dev machine. I verified by stashing locally inside the same fresh clone and re-running. They were skipped via SKIP=test-cli and via --no-verify on the local commits/push. CI runs them on a faster runner without issue.
  • GitHub-runner kernel is not the user's reproducer. The runner doesn't exhibit the original bug at k3s init — it surfaces later at sandbox-image build (Docker build stream error: failed to export layer: CreateDiff: Canceled). The negative phase therefore reports SKIP on the canonical-error-string check (the install.sh non-zero exit still proves the auto-fix is load-bearing). The original reporter's exact docker info (Driver: overlayfs + io.containerd.snapshotter.v1 on Ubuntu 24.04 with kernel 6.17) is more representative of real users.
  • CodeRabbit feedback. Four review passes, all addressed; replies posted inline on each thread. Final standing items are pinned in the doc + tests so future regressions trip CI (e.g. Dockerfile.toContain('FROM ubuntu:24.04@sha256:'), does not contain "github.com", RUN apt-get / RUN curl forbidden in final stage).

Summary by CodeRabbit

  • New Features

    • Onboarding detects Docker 26+ nested-overlay conflicts and automatically builds/reuses a local patched cluster image (defaults to fuse-overlayfs); env vars allow opting out or selecting snapshotter. On failure, onboarding falls back to the upstream image with explanatory logs.
  • Documentation

    • Added prerequisites and troubleshooting guidance describing detection, the automatic workaround, override env knobs, and a manual daemon.json remediation.
  • Tests

    • Added unit tests, an E2E validation script, and a nightly E2E job covering auto-fix behavior and idempotency.

Signed-off-by: Aaron Erickson aerickson@nvidia.com

…r 26+ overlayfs nested mount break

Detect the Docker 26+ containerd-snapshotter overlayfs configuration in
preflight (Linux + Driver=overlayfs + DriverStatus mentions
io.containerd.snapshotter.v1) and transparently build a local
nemoclaw-cluster:VER-SNAPSHOTTER-SHA image that installs fuse-overlayfs
and overrides the k3s CMD with --snapshotter=fuse-overlayfs. This routes
around the kernel-level nested-overlay limitation that crashes k3s
inside the upstream cluster image on hosts that use Docker 26+'s default
containerd image store.

The patched image is built on the user's machine via docker build from
the upstream image plus a tiny Dockerfile fragment. We do not publish
the image; the cache is the user's local Docker engine. Hosts on the
legacy overlay2 driver, macOS Docker Desktop, and Linux installs that
disable the containerd image store see no behavioral change.

Override knobs:
- NEMOCLAW_DISABLE_OVERLAY_FIX=1 — skip the auto-fix and run against
  the unmodified upstream cluster image.
- NEMOCLAW_OVERLAY_SNAPSHOTTER=native — bake the k3s `native` snapshotter
  into the patched image instead of `fuse-overlayfs`. Native uses more
  disk but has no FUSE dependency.

Refs #2481.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 26, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c228a03b-00ac-43d9-afc1-21cbac715786

📥 Commits

Reviewing files that changed from the base of the PR and between a3305ec and 0757227.

📒 Files selected for processing (1)
  • test/e2e/test-overlayfs-autofix.sh
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/e2e/test-overlayfs-autofix.sh

📝 Walkthrough

Walkthrough

Detects Linux Docker 26+ containerd snapshotter + overlayfs nested-overlay conflicts during onboarding and, when enabled, idempotently builds/uses a locally patched OpenShell cluster image (installing or selecting a compatible snapshotter). Honors env overrides and documents a manual /etc/docker/daemon.json remediation path.

Changes

Cohort / File(s) Summary
Documentation & References
docs/get-started/prerequisites.md, docs/reference/troubleshooting.md, .agents/skills/nemoclaw-user-get-started/references/prerequisites.md, .agents/skills/nemoclaw-user-reference/references/troubleshooting.md
Add notes and troubleshooting entries describing Docker 26+ containerd snapshotter overlayfs conflict, NemoClaw’s onboarding-time patched-cluster-image auto-fix, env var override knobs (NEMOCLAW_DISABLE_OVERLAY_FIX, NEMOCLAW_OVERLAY_SNAPSHOTTER), and an alternative manual /etc/docker/daemon.json workaround.
Cluster Image Patching Module & Tests
src/lib/cluster-image-patch.ts, src/lib/cluster-image-patch.test.ts
New module to generate patch Dockerfiles (install fuse-overlayfs or set native), extract upstream version, compute deterministic patched tags, and ensure patched image via docker image inspect/manifest inspect/pull/build with timeouts and errors; tests cover Dockerfile generation, parsing, caching, build/pull calls, idempotency, and failure paths.
Onboarding Integration
src/lib/onboard.ts
Integrates overlay auto-fix into gateway image selection: runs host preflight, respects NEMOCLAW_DISABLE_OVERLAY_FIX and NEMOCLAW_OVERLAY_SNAPSHOTTER, attempts ensurePatchedClusterImage() for an override tag, logs detection/build outcomes, and falls back to upstream image on opt-out, no conflict, or assessment/build failure.
Host Assessment & Tests
src/lib/preflight.ts, src/lib/preflight.test.ts
Add Docker docker info parsers (parseDockerStorageDriver, parseDockerUsesContainerdSnapshotter), extend HostAssessment with dockerStorageDriver, dockerUsesContainerdSnapshotter, and hasNestedOverlayConflict, and update tests to cover Linux-scoped conflict detection and parsing helpers.
E2E Test & CI
test/e2e/test-overlayfs-autofix.sh, .github/workflows/nightly-e2e.yaml
Add E2E script exercising autofix (positive/negative scenarios, idempotency, cleanup, daemon.json restore) and CI job overlayfs-autofix-e2e to run script nightly and upload logs on failure.

Sequence Diagram

sequenceDiagram
    participant User as User
    participant Onboard as onboard.ts
    participant Preflight as preflight.ts
    participant ClusterPatch as cluster-image-patch.ts
    participant Docker as Docker Daemon
    participant Cache as Local Image Cache

    User->>Onboard: start onboarding
    Onboard->>Preflight: assessHost()
    Preflight->>Docker: docker info
    Docker-->>Preflight: storage driver & status
    Preflight-->>Onboard: HostAssessment (hasNestedOverlayConflict: true)

    alt Conflict detected and fix enabled
        Onboard->>ClusterPatch: ensurePatchedClusterImage(upstream, snapshotter)
        ClusterPatch->>Cache: docker image inspect <patched-tag>
        alt cached
            Cache-->>ClusterPatch: image exists
        else not cached
            ClusterPatch->>Docker: docker manifest inspect / docker pull <upstream>
            Docker-->>ClusterPatch: upstream image/digest
            ClusterPatch->>ClusterPatch: generate Dockerfile (install fuse-overlayfs / set CMD)
            ClusterPatch->>Docker: docker build -> <patched-tag>
            Docker-->>ClusterPatch: built patched image
        end
        ClusterPatch-->>Onboard: patched-image-tag
    else No conflict or fix disabled
        Onboard-->>Onboard: use upstream image
    end

    Onboard-->>User: continue onboarding with chosen image
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

I’m a rabbit with a tiny spade,
I sniffed an overlay in the glade,
Built fuse patches, cached with care,
Hopped onboarding safe from snare,
🐇 Patch applied — we’re booting fair!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: auto-patching the cluster image to fix Docker 26+ overlayfs nested mount issues during onboarding.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/overlayfs-snapshotter-2481

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (3)
src/lib/preflight.ts (1)

186-189: Add a plain-text fallback in storage-driver parsing.

parseDockerUsesContainerdSnapshotter is resilient to both JSON and plain docker info text, but parseDockerStorageDriver currently only parses JSON. Adding a Storage Driver: fallback keeps hasNestedOverlayConflict detection robust if non-JSON output is ever passed in.

Suggested patch
 export function parseDockerStorageDriver(info = ""): string | undefined {
-  const driverMatch = info.match(/"Driver"\s*:\s*"([^"]+)"/);
-  return driverMatch?.[1];
+  const jsonDriver = info.match(/"Driver"\s*:\s*"([^"]+)"/)?.[1];
+  if (jsonDriver) return jsonDriver;
+  const textDriver = info.match(/^\s*Storage Driver:\s*(\S+)/m)?.[1];
+  return textDriver;
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/preflight.ts` around lines 186 - 189, The parseDockerStorageDriver
function only extracts the driver from JSON; update parseDockerStorageDriver to
first try the existing JSON regex and if that fails fall back to a plain-text
match for lines like "Storage Driver: <value>" (e.g., a regexp capturing the
non-newline value after "Storage Driver:"). Return the captured value or
undefined as before; mirror the resilient approach used in
parseDockerUsesContainerdSnapshotter so hasNestedOverlayConflict works with
non-JSON docker info.
docs/reference/troubleshooting.md (1)

242-242: Reduce em-dash usage in this paragraph (LLM pattern detected).

This line uses multiple em dashes in one paragraph; prefer periods or commas for clarity.

As per coding guidelines, "Excessive em dashes. One per paragraph is fine; multiple per paragraph ... should be flagged."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/reference/troubleshooting.md` at line 242, The sentence in
troubleshooting.md uses multiple em dashes; rewrite the sentence to remove extra
em dashes (keep at most one) by splitting into two sentences or using
commas/parentheses for clarity—e.g., replace "The same hardware running Docker
25 or earlier — or any Docker version with the containerd image store disabled —
uses the legacy `overlay2` driver and is unaffected." with a version that uses a
comma or period to separate clauses so only one em dash remains (edit the exact
text in docs/reference/troubleshooting.md).
src/lib/onboard.ts (1)

3613-3616: Keep getGatewayStartEnv() free of build side effects.

Calling applyOverlayfsAutoFix() here means a helper that sounds like pure env assembly now probes Docker and may build a local image. That makes the function surprisingly expensive and harder to reuse safely. I’d resolve the overlay override in the gateway-start path and keep this helper focused on serializing the already-chosen env values.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard.ts` around lines 3613 - 3616, getGatewayStartEnv() currently
calls applyOverlayfsAutoFix(stableGatewayImage) which introduces build/probing
side effects; remove that call so getGatewayStartEnv() only assembles/serializes
env values. Resolve overlayOverride earlier in the gateway-start path (call
applyOverlayfsAutoFix(stableGatewayImage) where the process actually decides the
image), then either set gatewayEnv.OPENSHELL_CLUSTER_IMAGE before invoking
getGatewayStartEnv() or add an explicit parameter to
getGatewayStartEnv(overlayOverride) so the helper uses the already-resolved
value and does not perform Docker/build work itself.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/get-started/prerequisites.md`:
- Line 51: The paragraph containing the sentence about Linux hosts, Docker 26+,
the containerd image store, and how `nemoclaw onboard` builds a
`fuse-overlayfs`-enabled cluster image should be split so each sentence is on
its own source line; locate the line containing the phrases "Linux hosts running
Docker 26", "containerd image store", "`nemoclaw onboard` transparently builds",
"bypass a kernel-level nested-overlay limitation in k3s", and "See the
troubleshooting guide" and break it into separate lines (one sentence per line)
while preserving exact wording and links.

In `@docs/reference/troubleshooting.md`:
- Around line 234-247: The new troubleshooting section has multiple sentences on
single source lines; reflow each sentence onto its own line so each line
contains exactly one sentence, preserving original wording and punctuation;
specifically split the paragraph that starts with "Docker Engine 26 and later
default fresh installations..." and the paragraph that begins "NemoClaw detects
the Docker 26+ containerd-snapshotter..." (also apply the same
one-sentence-per-line fix to the later small paragraph referenced around "The
auto-fix runs once per OpenShell version..." and the note about hosts without
the conflict) so diffs are readable and no sentence remains merged on the same
source line.

In `@src/lib/cluster-image-patch.ts`:
- Around line 78-82: The current extractUpstreamVersion treats a colon anywhere
as a tag (so registry:5000/repo yields "5000/repo"); update
extractUpstreamVersion to remove any digest (after "@") then find the last "/"
index and only consider a colon that appears after that last "/" as a tag
delimiter: if a colon exists after the last slash return the substring after
that colon, otherwise return "unknown". Reference the extractUpstreamVersion
function and variables like afterAt/afterColon when modifying the parsing logic.

In `@src/lib/onboard.ts`:
- Around line 3645-3647: The code silently maps any unknown
NEMOCLAW_OVERLAY_SNAPSHOTTER string to "fuse-overlayfs"; update the logic around
requestedSnapshotter and snapshotter to validate the env var: if the env var is
empty keep the existing default ("fuse-overlayfs"), but if non-empty only accept
the exact allowed values "native" or "fuse-overlayfs" (consider normalizing
whitespace/casing first), and if the value is invalid emit a clear error (or
process exit) instead of silently falling back. Locate the lines where
requestedSnapshotter and snapshotter are defined and replace the permissive
fallback with explicit validation that throws or logs and exits on invalid input
so typos like "NATIVE" or "fuse" don't silently change behavior.

---

Nitpick comments:
In `@docs/reference/troubleshooting.md`:
- Line 242: The sentence in troubleshooting.md uses multiple em dashes; rewrite
the sentence to remove extra em dashes (keep at most one) by splitting into two
sentences or using commas/parentheses for clarity—e.g., replace "The same
hardware running Docker 25 or earlier — or any Docker version with the
containerd image store disabled — uses the legacy `overlay2` driver and is
unaffected." with a version that uses a comma or period to separate clauses so
only one em dash remains (edit the exact text in
docs/reference/troubleshooting.md).

In `@src/lib/onboard.ts`:
- Around line 3613-3616: getGatewayStartEnv() currently calls
applyOverlayfsAutoFix(stableGatewayImage) which introduces build/probing side
effects; remove that call so getGatewayStartEnv() only assembles/serializes env
values. Resolve overlayOverride earlier in the gateway-start path (call
applyOverlayfsAutoFix(stableGatewayImage) where the process actually decides the
image), then either set gatewayEnv.OPENSHELL_CLUSTER_IMAGE before invoking
getGatewayStartEnv() or add an explicit parameter to
getGatewayStartEnv(overlayOverride) so the helper uses the already-resolved
value and does not perform Docker/build work itself.

In `@src/lib/preflight.ts`:
- Around line 186-189: The parseDockerStorageDriver function only extracts the
driver from JSON; update parseDockerStorageDriver to first try the existing JSON
regex and if that fails fall back to a plain-text match for lines like "Storage
Driver: <value>" (e.g., a regexp capturing the non-newline value after "Storage
Driver:"). Return the captured value or undefined as before; mirror the
resilient approach used in parseDockerUsesContainerdSnapshotter so
hasNestedOverlayConflict works with non-JSON docker info.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 2fdf7f3c-a409-4c3b-9918-cc8c205b6368

📥 Commits

Reviewing files that changed from the base of the PR and between f439189 and 750a46e.

📒 Files selected for processing (9)
  • .agents/skills/nemoclaw-user-get-started/references/prerequisites.md
  • .agents/skills/nemoclaw-user-reference/references/troubleshooting.md
  • docs/get-started/prerequisites.md
  • docs/reference/troubleshooting.md
  • src/lib/cluster-image-patch.test.ts
  • src/lib/cluster-image-patch.ts
  • src/lib/onboard.ts
  • src/lib/preflight.test.ts
  • src/lib/preflight.ts

Comment thread docs/get-started/prerequisites.md Outdated
Comment thread docs/reference/troubleshooting.md Outdated
Comment thread src/lib/cluster-image-patch.ts
Comment thread src/lib/onboard.ts Outdated
Add test/e2e/test-overlayfs-autofix.sh and a matching nightly-e2e
workflow job that:

1. Force the GitHub runner's Docker daemon onto the containerd image
   store via /etc/docker/daemon.json (`features.containerd-snapshotter:
   true`), restart Docker, and verify Driver=overlayfs +
   io.containerd.snapshotter.v1 — the exact bug-triggering config from
   the user report.
2. Install + onboard NemoClaw and assert the auto-fix triggered
   (detection log line present, nemoclaw-cluster:*-fuse-overlayfs-*
   image built, gateway running on the patched image, cluster log
   clean of the canonical k3s nested-overlay error).
3. Re-run onboard and assert the patched image was reused (Created
   timestamp unchanged) — proves the cache path is hit.
4. Re-run with NEMOCLAW_DISABLE_OVERLAY_FIX=1 under a bounded
   `timeout 300` and assert k3s logs the canonical
   `overlayfs snapshotter cannot be enabled` error — proves the
   auto-fix is the load-bearing piece, not coincidence.
5. Always revert daemon.json + restart Docker on EXIT so a failed
   test doesn't leave the runner in a degraded state.

The script auto-skips on hosts whose Docker version doesn't support
the feature flag or doesn't actually flip to overlayfs after the
restart.

This test is **temporary**. The header comments and the workflow
job comments call out removal criteria: delete the script, the job,
and the matching notify-on-failure entry in the same PR that deletes
src/lib/cluster-image-patch.ts (after OpenShell roadmap #873 lands a
non-k3s sandbox driver).

Refs #2481.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/test-overlayfs-autofix.sh`:
- Around line 324-337: The test allows an empty patched_tag which can make the
later gateway_image equality check falsely pass when both are empty; update the
flow in test-overlayfs-autofix.sh so that after computing patched_tag you guard
the following assertions: if patched_tag is empty then immediately call fail (or
exit) and do not evaluate the gateway_image check (the block that inspects
GATEWAY_CONTAINER and compares gateway_image to patched_tag). Apply the same
guarding pattern to the similar block around lines 360-383 so you never compare
gateway_image (or other resources) against an empty patched_tag.
- Around line 109-120: The cleanup in revert_daemon_config can copy an invalid
sentinel into DAEMON_JSON and restart Docker, leaving Docker broken; update the
logic so you never write an invalid sentinel to DAEMON_JSON — either (a) use a
valid JSON sentinel (e.g., {}) instead of "/* none */", or (b) before copying
DAEMON_JSON_BACKUP into DAEMON_JSON validate the backup is valid JSON and only
restore if valid; additionally when handling the branch that removes the
containerd-snapshotter flag (the elif that checks for '"containerd-snapshotter":
*true') avoid simply removing DAEMON_JSON unconditionally — instead parse/edit
the JSON to delete just the flag or recreate a valid minimal JSON file so the
restarted docker gets valid JSON. Ensure references: revert_daemon_config,
DAEMON_JSON_BACKUP, DAEMON_JSON, and the containerd-snapshotter removal branch
are updated accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4f45cb18-365b-4d7b-a06d-eeb3e706dc92

📥 Commits

Reviewing files that changed from the base of the PR and between 750a46e and 2a53572.

📒 Files selected for processing (2)
  • .github/workflows/nightly-e2e.yaml
  • test/e2e/test-overlayfs-autofix.sh

Comment thread test/e2e/test-overlayfs-autofix.sh
Comment thread test/e2e/test-overlayfs-autofix.sh
@ericksoa ericksoa changed the title fix(onboard): auto-build fuse-overlayfs cluster image to bypass Docker 26+ overlayfs nested mount break (#2481) fix(onboard): auto-patch cluster image for Docker 26+ overlayfs nested mount break Apr 26, 2026
- src/lib/cluster-image-patch.ts: extractUpstreamVersion now treats a
  colon as a tag delimiter only when it appears after the last path
  separator, so registry-port references like
  registry.example.com:5000/openshell/cluster no longer parse the port
  + repo path as a tag. Three new unit tests cover the registry-port
  case, the registry-port-plus-tag case, and the trailing-colon case.

- src/lib/onboard.ts: validate NEMOCLAW_OVERLAY_SNAPSHOTTER. Empty
  keeps the fuse-overlayfs default; "native" or "fuse-overlayfs"
  (case-insensitive after trim) selects that mode; anything else logs
  a warning and falls back to fuse-overlayfs so typos like NATIVE or
  fuse don't silently change behavior.

- test/e2e/test-overlayfs-autofix.sh:
  * Replace the "/* none */" sentinel inside the daemon.json backup
    file with a separate absent-marker file, so revert never writes
    invalid JSON into /etc/docker/daemon.json. Revert checks the
    marker first and removes the file outright when no original
    existed; the legacy "remove our edit" branch is no longer needed
    and was deleted.
  * Guard the gateway-image equality check, the cluster-log scan, and
    the idempotency timestamp comparison so they only run when phase
    3 actually produced a patched_tag — preventing false PASSes on
    empty == empty comparisons. Idempotency now reports SKIP in that
    degraded case rather than a misleading PASS or FAIL.

- docs/get-started/prerequisites.md, docs/reference/troubleshooting.md:
  Reflow the new content to one sentence per line per the project
  doc style guide.

Refs #2481.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/onboard.ts`:
- Around line 3668-3672: The patched-image build path can hang because
ensurePatchedClusterImage currently runs docker pull/build synchronously without
timeout; update ensurePatchedClusterImage to enforce a hard timeout for external
commands (docker pull and docker build) — either reuse a child-process API that
supports timeouts (e.g., execFile/exec with the timeout option or a spawn
wrapper that kills the child after N ms) or run the spawn child and attach a
timer to kill it and reject on timeout; ensure the timeout triggers a clean kill
and throws an error that applyOverlayfsAutoFix can catch so onboarding won't
hang indefinitely.

In `@test/e2e/test-overlayfs-autofix.sh`:
- Around line 421-430: Save the current errexit state before entering the
negative phase and restore it afterward instead of unconditionally running set
-e; specifically, before running the block that sets
NEMOCLAW_DISABLE_OVERLAY_FIX and runs timeout ... install.sh (the code that
assigns negative_exit), capture the current errexit setting (e.g., store whether
"set -e" is active in a variable), then change the errexit behavior only as
needed for the negative test, and after assigning negative_exit restore the
original state so subsequent calls to pass() and fail() (which use ((PASS++)) /
((FAIL++))) and the final summary logic run with the original errexit behavior.
Ensure you reference and preserve the negative_exit handling and do not alter
pass()/fail() implementations.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4e227a0f-9d04-48cd-8282-996acfd4f9bf

📥 Commits

Reviewing files that changed from the base of the PR and between 2a53572 and 4796ad5.

📒 Files selected for processing (8)
  • .agents/skills/nemoclaw-user-get-started/references/prerequisites.md
  • .agents/skills/nemoclaw-user-reference/references/troubleshooting.md
  • docs/get-started/prerequisites.md
  • docs/reference/troubleshooting.md
  • src/lib/cluster-image-patch.test.ts
  • src/lib/cluster-image-patch.ts
  • src/lib/onboard.ts
  • test/e2e/test-overlayfs-autofix.sh
✅ Files skipped from review due to trivial changes (3)
  • docs/get-started/prerequisites.md
  • .agents/skills/nemoclaw-user-get-started/references/prerequisites.md
  • docs/reference/troubleshooting.md

Comment thread src/lib/onboard.ts
Comment thread test/e2e/test-overlayfs-autofix.sh Outdated
- src/lib/cluster-image-patch.ts: add wall-clock timeouts to docker
  pull (10 min default) and docker build (5 min default) via the
  spawnSync `timeout` option, threaded through runner.run/runCapture.
  Without this, a hung daemon or stalled registry on the patched-image
  build path would hang the entire onboard before any of OpenShell's
  gateway-start timeouts could trigger. New env-injectable opts
  pullTimeoutMs and buildTimeoutMs allow tests + ops to override.

- test/e2e/test-overlayfs-autofix.sh: drop the unsafe set +e / set -e
  toggle around the negative-phase install.sh invocation. The script
  header is `set -uo pipefail` (errexit OFF), so the timeout's non-
  zero exit never threatened us in the first place; meanwhile the
  trailing `set -e` would have enabled errexit for the rest of the
  script and made `((PASS++))` fatal under any zero-valued counter
  due to bash post-increment semantics.

Refs #2481.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
test/e2e/test-overlayfs-autofix.sh (1)

338-356: ⚠️ Potential issue | 🟠 Major

Don't treat a missing gateway container as a clean log.

If the positive phase never creates $GATEWAY_CONTAINER, Line 352 reads daemon error text from docker logs; grep -q misses it, and the script reports a PASS. Move the log check under the same presence guard and fail when the container is absent. This is the same guard gap that was flagged earlier, just on the remaining assertion.

💡 Proposed fix
-# Cluster log must NOT carry the original error string.
-if docker logs "$GATEWAY_CONTAINER" 2>&1 | grep -q "overlayfs.*snapshotter cannot be enabled"; then
-  fail "Cluster log still contains the nested-overlay error after auto-fix"
-else
-  pass "Cluster log clean of the nested-overlay error"
-fi
+# Cluster log must NOT carry the original error string.
+if [ -z "$patched_tag" ]; then
+  skip "Cluster log check skipped (no patched image from phase 3)"
+elif docker ps -a --format '{{.Names}}' | grep -q "^${GATEWAY_CONTAINER}$"; then
+  if docker logs "$GATEWAY_CONTAINER" 2>&1 | grep -q "overlayfs.*snapshotter cannot be enabled"; then
+    fail "Cluster log still contains the nested-overlay error after auto-fix"
+  else
+    pass "Cluster log clean of the nested-overlay error"
+  fi
+else
+  fail "Gateway container '$GATEWAY_CONTAINER' was not found after the positive phase"
+fi
#!/bin/bash
set -euo pipefail

file="test/e2e/test-overlayfs-autofix.sh"

echo "== current positive-phase gateway assertions =="
nl -ba "$file" | sed -n '331,356p'

echo
echo "== non-matching daemon error demo =="
if printf '%s\n' 'Error response from daemon: No such container: openshell-cluster-nemoclaw' | \
  grep -q 'overlayfs.*snapshotter cannot be enabled'; then
  echo "matched"
else
  echo "no match -> current script would take the PASS branch"
fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/test-overlayfs-autofix.sh` around lines 338 - 356, The
log-cleanliness check currently runs unguarded and will falsely PASS if the
gateway container ($GATEWAY_CONTAINER) was never created; move the docker logs /
grep block inside the same "if [ -n \"$patched_tag\" ]; then" presence guard
(the block that computes gateway_image and compares it to patched_tag) so the
script only inspects logs when the container exists, and if the container is
absent fail explicitly; reference the variables/commands patched_tag,
GATEWAY_CONTAINER, gateway_image, and the docker logs | grep -q
"overlayfs.*snapshotter cannot be enabled" pipeline when updating the control
flow.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/cluster-image-patch.ts`:
- Around line 81-97: buildPatchDockerfile currently always adds "fuse-overlayfs"
to the apt-get install list even when snapshotter === "native"; change
buildPatchDockerfile to conditionally include the fuse-overlayfs install only
when snapshotter is not "native" (i.e., modify the array/strings built in
buildPatchDockerfile to omit the package and trailing backslash when snapshotter
=== "native"), and add/adjust tests to assert that
buildPatchDockerfile("native") does NOT contain "fuse-overlayfs" while
buildPatchDockerfile("containerd" or other) does contain it; reference the
buildPatchDockerfile function and the snapshotter parameter when making this
change.

In `@test/e2e/test-overlayfs-autofix.sh`:
- Around line 285-290: The positive and idempotency phases call env but inherit
the caller's environment, which lets pre-set variables like
NEMOCLAW_DISABLE_OVERLAY_FIX or NEMOCLAW_OVERLAY_SNAPSHOTTER leak into the test;
update the env invocations used for the "positive" and "idempotency/replay"
phases (the two blocks invoking env ... bash install.sh --non-interactive
>"$INSTALL_LOG" 2>&1 &) to be hermetic by using a clean environment (e.g., env
-i) and explicitly set the required variables (NEMOCLAW_NON_INTERACTIVE=1
NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE=1 NEMOCLAW_SANDBOX_NAME="$SANDBOX_NAME"
NEMOCLAW_RECREATE_SANDBOX=1 and explicitly set NEMOCLAW_DISABLE_OVERLAY_FIX=0
and NEMOCLAW_OVERLAY_SNAPSHOTTER=fuse-overlayfs) so the test always exercises
the fuse-overlayfs path; apply the same change to the second env block mentioned
in the comment.

---

Duplicate comments:
In `@test/e2e/test-overlayfs-autofix.sh`:
- Around line 338-356: The log-cleanliness check currently runs unguarded and
will falsely PASS if the gateway container ($GATEWAY_CONTAINER) was never
created; move the docker logs / grep block inside the same "if [ -n
\"$patched_tag\" ]; then" presence guard (the block that computes gateway_image
and compares it to patched_tag) so the script only inspects logs when the
container exists, and if the container is absent fail explicitly; reference the
variables/commands patched_tag, GATEWAY_CONTAINER, gateway_image, and the docker
logs | grep -q "overlayfs.*snapshotter cannot be enabled" pipeline when updating
the control flow.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 22001824-b875-4c64-b317-de077617429c

📥 Commits

Reviewing files that changed from the base of the PR and between 4796ad5 and 86b5789.

📒 Files selected for processing (2)
  • src/lib/cluster-image-patch.ts
  • test/e2e/test-overlayfs-autofix.sh

Comment thread src/lib/cluster-image-patch.ts
Comment thread test/e2e/test-overlayfs-autofix.sh Outdated
The first nightly e2e run on PR #2487 surfaced two real issues with the
patched-image build path; both fixed here.

1. apt-get install fuse-overlayfs failed with `dpkg-deb` invoking
   `tar --warning=no-timestamp`, which the upstream cluster image's
   BusyBox tar does not understand. Result: any apt-installed package
   fails to extract during the patched-image build, and
   `applyOverlayfsAutoFix` falls back to the unmodified upstream image.
   The negative-test guard ("Cluster log clean of the nested-overlay
   error") still passed because the runner's kernel happens to support
   k3s's nested overlay at startup, masking the underlying regression
   until the sandbox-image build (which does NOT support it) failed
   later in the negative phase with `CreateDiff: Canceled`.

   Fix: download the prebuilt static `fuse-overlayfs` binary directly
   from `containers/fuse-overlayfs` (pinned at v1.14, mapped from
   TARGETARCH amd64/arm64 to upstream naming x86_64/aarch64). Static
   binary bundles libfuse, so we don't need libfuse3-3 in the base
   image. No apt, no dpkg, no tar dependency.

2. Negative-phase assertion in test-overlayfs-autofix.sh required the
   k3s startup error string to appear verbatim. On the GitHub Actions
   ubuntu-latest runner the same root cause surfaces later in the flow
   (sandbox image build) instead of at k3s init, so the canonical
   string is absent there. The non-zero install.sh exit already proves
   "without the auto-fix, things don't work" — soften the secondary
   check from FAIL to SKIP and let the artifact diff reveal which
   downstream error actually fired.

Two new unit tests assert the Dockerfile no longer references apt and
that the TARGETARCH mapping is in place.

Refs #2481.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
src/lib/cluster-image-patch.ts (1)

109-121: ⚠️ Potential issue | 🟠 Major

Avoid coupling native snapshotter mode to fuse binary download/install.

buildPatchDockerfile("native") still executes the fuse-overlayfs download path. That makes the native override fail on restricted/offline environments even though native mode does not require fuse-overlayfs.

Suggested patch
 export function buildPatchDockerfile(snapshotter: SnapshotterChoice): string {
+  const installFuseOverlayfs =
+    snapshotter === "fuse-overlayfs"
+      ? [
+          "ARG TARGETARCH=amd64",
+          `ARG FUSE_OVERLAYFS_VERSION=${FUSE_OVERLAYFS_VERSION}`,
+          'RUN set -eux; \\',
+          '    case "$TARGETARCH" in \\',
+          "      amd64) FUSE_ARCH=x86_64 ;; \\",
+          "      arm64) FUSE_ARCH=aarch64 ;; \\",
+          '      *) echo "Unsupported TARGETARCH: $TARGETARCH" >&2; exit 1 ;; \\',
+          "    esac; \\",
+          '    curl -fsSL "https://github.com/containers/fuse-overlayfs/releases/download/${FUSE_OVERLAYFS_VERSION}/fuse-overlayfs-${FUSE_ARCH}" \\',
+          "      -o /usr/local/bin/fuse-overlayfs; \\",
+          "    chmod +x /usr/local/bin/fuse-overlayfs; \\",
+          "    /usr/local/bin/fuse-overlayfs --version",
+        ]
+      : [];
+
   return [
     "# Generated by NemoClaw — see src/lib/cluster-image-patch.ts.",
     "ARG UPSTREAM",
     "FROM ${UPSTREAM}",
     "USER root",
-    "ARG TARGETARCH=amd64",
-    `ARG FUSE_OVERLAYFS_VERSION=${FUSE_OVERLAYFS_VERSION}`,
-    'RUN set -eux; \\',
-    '    case "$TARGETARCH" in \\',
-    "      amd64) FUSE_ARCH=x86_64 ;; \\",
-    "      arm64) FUSE_ARCH=aarch64 ;; \\",
-    '      *) echo "Unsupported TARGETARCH: $TARGETARCH" >&2; exit 1 ;; \\',
-    "    esac; \\",
-    '    curl -fsSL "https://github.com/containers/fuse-overlayfs/releases/download/${FUSE_OVERLAYFS_VERSION}/fuse-overlayfs-${FUSE_ARCH}" \\',
-    "      -o /usr/local/bin/fuse-overlayfs; \\",
-    "    chmod +x /usr/local/bin/fuse-overlayfs; \\",
-    "    /usr/local/bin/fuse-overlayfs --version",
+    ...installFuseOverlayfs,
     `CMD ["server", "--snapshotter=${snapshotter}"]`,
     "",
   ].join("\n");
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/cluster-image-patch.ts` around lines 109 - 121, The Dockerfile patch
currently always inserts the fuse-overlayfs download/install block even when
buildPatchDockerfile is invoked with snapshotter="native", causing failures in
restricted/offline environments; update buildPatchDockerfile to only emit the
ARG FUSE_OVERLAYFS_VERSION plus the curl/chmod/version lines when the
snapshotter variable is not "native" (i.e., guard the fuse-overlayfs block on
snapshotter !== "native"), keeping the ARG TARGETARCH and CMD lines intact, so
that buildPatchDockerfile("native") skips any fuse-overlayfs download/install
steps.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/cluster-image-patch.ts`:
- Around line 117-118: The hardcoded GitHub URL in the curl command string (the
literal containing
"https://github.com/containers/fuse-overlayfs/releases/download/${FUSE_OVERLAYFS_VERSION}/fuse-overlayfs-${FUSE_ARCH}"
that downloads to /usr/local/bin/fuse-overlayfs) must be removed and replaced
with a policy-compliant source; update the string in
src/lib/cluster-image-patch.ts to point to the approved/internal artifact
repository or package manager endpoint your org uses (preserving
FUSE_OVERLAYFS_VERSION and FUSE_ARCH interpolation), or switch the download to
use an internal proxy function that resolves artifacts, and ensure any related
config/env references are updated to use the internal artifact location.

---

Duplicate comments:
In `@src/lib/cluster-image-patch.ts`:
- Around line 109-121: The Dockerfile patch currently always inserts the
fuse-overlayfs download/install block even when buildPatchDockerfile is invoked
with snapshotter="native", causing failures in restricted/offline environments;
update buildPatchDockerfile to only emit the ARG FUSE_OVERLAYFS_VERSION plus the
curl/chmod/version lines when the snapshotter variable is not "native" (i.e.,
guard the fuse-overlayfs block on snapshotter !== "native"), keeping the ARG
TARGETARCH and CMD lines intact, so that buildPatchDockerfile("native") skips
any fuse-overlayfs download/install steps.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 97797bf1-6a09-4411-9cfe-4d449c62cf71

📥 Commits

Reviewing files that changed from the base of the PR and between 86b5789 and cfd49f6.

📒 Files selected for processing (3)
  • src/lib/cluster-image-patch.test.ts
  • src/lib/cluster-image-patch.ts
  • test/e2e/test-overlayfs-autofix.sh
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/e2e/test-overlayfs-autofix.sh

Comment thread src/lib/cluster-image-patch.ts Outdated
Second nightly e2e (run 24963030327) showed the previous attempt at
fetching the static binary in-place still failed: the upstream cluster
image not only lacks GNU tar (so apt is broken) but also lacks curl
(so `RUN curl …` exits 127). The upstream Dockerfile pulls curl in
throwaway builder stages exactly to avoid bloating the final image,
and never copies it forward.

Switch to a multi-stage build:

  FROM alpine:3.20 AS bin-fetcher
    apk add curl ca-certificates
    curl -fsSL <upstream releases URL> -o /tmp/fuse-overlayfs
    chmod +x /tmp/fuse-overlayfs

  FROM ${UPSTREAM}
    COPY --from=bin-fetcher /tmp/fuse-overlayfs /usr/local/bin/

Alpine's ~5 MB base + apk is fast and reliable. The static binary
bundles libfuse, so the cluster image gains nothing from apt and we
add zero RUN steps to the final stage.

Verified locally that the bin-fetcher stage builds cleanly and the
binary reports its version (`fuse-overlayfs --version` succeeds).

Test updates:
- Assert the multi-stage shape (alpine builder + COPY --from).
- Replace the apt-only assertion with a stricter "no RUN apt-get OR
  curl in the final stage" check; Reverting to either path will trip
  the e2e via install logs.

Refs #2481.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Three findings:

1. Hardcoded link to a third-party GitHub repository (containers/fuse-
   overlayfs releases) in TS source violated the repo "No External
   Project Links" policy. Replaced with `apt-get install fuse-overlayfs`
   from a clean ubuntu:24.04 builder stage — Ubuntu's archive is an
   officially-depended-on tool. The bin-fetcher stage stages the binary
   plus its sole non-libc shared dep (libfuse3.so.3) under /export, and
   the final stage `COPY --from`s both into the cluster image and runs
   ldconfig.

   Verified locally: full multi-stage build with `ubuntu:24.04` as the
   UPSTREAM stand-in produces an image where `fuse-overlayfs --version`
   reports `fuse-overlayfs: version 1.13-dev / FUSE library version
   3.14.0`. New unit test asserts the Dockerfile contains no github.com
   link.

2. `buildPatchDockerfile("native")` was unconditionally installing
   fuse-overlayfs — wasted work and unnecessary apt traffic for users
   who picked the `native` snapshotter explicitly. The native branch is
   now a 4-line Dockerfile (FROM upstream + CMD override) with no
   builder stage. Test asserts no fuse-overlayfs / apt-get / ubuntu
   strings appear in the native Dockerfile.

3. test-overlayfs-autofix.sh phase 3 (positive) and phase 4 (replay)
   used bare `env … bash install.sh`, which inherits any
   NEMOCLAW_DISABLE_OVERLAY_FIX or NEMOCLAW_OVERLAY_SNAPSHOTTER the
   caller exported. If the caller had either set, the positive-phase
   assertions on lines 325-345 (which require the fuse-overlayfs path)
   would silently break. Added `env -u …` to both invocations so the
   override knobs are explicitly stripped before install.sh runs.

Refs #2481.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
src/lib/cluster-image-patch.test.ts (1)

4-11: Import tests from source instead of dist artifacts.

Line 11 currently targets ../../dist/lib/cluster-image-patch. That can make tests pass against stale compiled output rather than the current TS source under review.

Suggested change
-import {
+import {
   buildPatchDockerfile,
   ClusterImagePatchError,
   computePatchedTag,
   ensurePatchedClusterImage,
   extractUpstreamVersion,
-} from "../../dist/lib/cluster-image-patch";
+} from "./cluster-image-patch";
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/cluster-image-patch.test.ts` around lines 4 - 11, Tests are importing
from the compiled dist bundle which can hide source changes; update the import
path to point at the TypeScript source module (replace the
"../../dist/lib/cluster-image-patch" import with the source module path, e.g.
"../../src/lib/cluster-image-patch") so the tests load the current
implementations of buildPatchDockerfile, ClusterImagePatchError,
computePatchedTag, ensurePatchedClusterImage, and extractUpstreamVersion instead
of stale compiled artifacts.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/cluster-image-patch.ts`:
- Around line 198-200: The probe path that calls `docker image inspect` is using
`runCaptureImpl` without a timeout which can hang; update the probe logic in the
`imageExists`/probe code to pass the `timeoutMs` option (or use the same timeout
variable used for pull/build) into the `runCaptureImpl` invocation that executes
`docker image inspect` (the probe branch referenced around the `imageExists(tag,
runCaptureImpl)` usage and the subsequent `docker image inspect` calls), and
ensure both probe occurrences (the one around lines 253–260 and the one around
257–259) include this timeout so a stuck Docker daemon cannot block onboarding.
- Around line 115-119: The Dockerfile template currently uses a floating base
image string "FROM ubuntu:24.04 AS bin-fetcher" and installs an unpinned
"fuse-overlayfs", which breaks the deterministic tag computed later (lines
153–166); change the template so the builder base is pinned to a specific digest
(replace the floating "ubuntu:24.04" token with a pinned digest value or an
interpolated constant) and make the fuse-overlayfs package version configurable
via an environment variable (e.g., FUSE_OVERLAYFS_VERSION) so the apt-get
install line in the template uses that pinned version when provided; update any
code that renders the Dockerfile string in src/lib/cluster-image-patch.ts to
accept and substitute these new inputs while preserving the existing
deterministic tag computation.

---

Nitpick comments:
In `@src/lib/cluster-image-patch.test.ts`:
- Around line 4-11: Tests are importing from the compiled dist bundle which can
hide source changes; update the import path to point at the TypeScript source
module (replace the "../../dist/lib/cluster-image-patch" import with the source
module path, e.g. "../../src/lib/cluster-image-patch") so the tests load the
current implementations of buildPatchDockerfile, ClusterImagePatchError,
computePatchedTag, ensurePatchedClusterImage, and extractUpstreamVersion instead
of stale compiled artifacts.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: d57d3eb0-02a4-4e4b-92af-59a5a07eb653

📥 Commits

Reviewing files that changed from the base of the PR and between 86b5789 and 6df1dea.

📒 Files selected for processing (3)
  • src/lib/cluster-image-patch.test.ts
  • src/lib/cluster-image-patch.ts
  • test/e2e/test-overlayfs-autofix.sh
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/e2e/test-overlayfs-autofix.sh

Comment thread src/lib/cluster-image-patch.ts Outdated
Comment thread src/lib/cluster-image-patch.ts Outdated
The fourth nightly e2e run (24964304092) showed the auto-fix end-to-end
working as designed:

  PASS: install.sh + onboard completed (exit 0)
  PASS: Onboard log contains the auto-fix detection message
  PASS: Patched cluster image present: nemoclaw-cluster:0.0.36-fuse-overlayfs-ab391cf9
  PASS: Gateway container is running the patched image
  PASS: Cluster log clean of the nested-overlay error

The two FAILs (Phase 4) were both downstream of an unrelated issue:
calling install.sh a second time forced a fresh OpenClaw sandbox-image
build, which is independently flaky on GitHub-Actions runner kernels
because of nested-overlay limitations in the inner buildkit. That's
the same kernel issue that causes the negative phase's install.sh to
fail (which we already accept as expected). It has nothing to do with
the patched-image cache hit we actually want to validate here.

Restructure Phase 4 so it tests the narrower invariant directly:

  cd $REPO_ROOT
  node -e 'require("./dist/lib/cluster-image-patch")
            .ensurePatchedClusterImage({ upstreamImage })' >/dev/null

  # assert returned tag == phase 3's patched_tag
  # assert docker inspect Created timestamp unchanged

The upstream image tag is derived from patched_tag's encoded version
component. No install.sh round-trip, no sandbox build, no flakiness.

Drops the now-unused ONBOARD_LOG_REPLAY constant + its workflow
artifact upload.

Refs #2481.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
test/e2e/test-overlayfs-autofix.sh (1)

359-364: ⚠️ Potential issue | 🟡 Minor

Don’t mark the positive-path log check as PASS when no gateway logs exist.

This assertion still runs even when the gateway container is missing, and docker logs failure drops into the else branch, producing a false PASS. Guard it the same way as the image check: skip when patched_tag is empty, and fail if the container is absent.

🧪 Tighten the guard
-# Cluster log must NOT carry the original error string.
-if docker logs "$GATEWAY_CONTAINER" 2>&1 | grep -q "overlayfs.*snapshotter cannot be enabled"; then
-  fail "Cluster log still contains the nested-overlay error after auto-fix"
-else
-  pass "Cluster log clean of the nested-overlay error"
-fi
+# Cluster log must NOT carry the original error string.
+if [ -z "$patched_tag" ]; then
+  skip "Cluster log check skipped (no patched image from phase 3)"
+elif docker ps -a --format '{{.Names}}' | grep -qx "$GATEWAY_CONTAINER"; then
+  if docker logs "$GATEWAY_CONTAINER" 2>&1 | grep -q "overlayfs.*snapshotter cannot be enabled"; then
+    fail "Cluster log still contains the nested-overlay error after auto-fix"
+  else
+    pass "Cluster log clean of the nested-overlay error"
+  fi
+else
+  fail "Gateway container '$GATEWAY_CONTAINER' is missing; cannot verify cluster logs"
+fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/test-overlayfs-autofix.sh` around lines 359 - 364, The positive-path
gateway log check should be guarded like the image check: first skip when
patched_tag is empty, then verify the gateway container exists and fail if it
does not; only then run docker logs for GATEWAY_CONTAINER and assert absence of
"overlayfs.*snapshotter cannot be enabled". Modify the block that references
GATEWAY_CONTAINER to check patched_tag (skip when empty), ensure the container
is present (fail if missing), and only then perform the docker logs | grep check
to decide fail/pass.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/test-overlayfs-autofix.sh`:
- Around line 97-98: The current DAEMON_JSON_BACKUP and
DAEMON_JSON_ABSENT_MARKER use predictable /tmp paths which can be symlinked;
change the script to create a private temp directory with mktemp -d (e.g. assign
to TMP_WORKDIR or E2E_TMPDIR) at startup, then set
DAEMON_JSON_BACKUP="$TMP_WORKDIR/nemoclaw-e2e-daemon.json.bak" and
DAEMON_JSON_ABSENT_MARKER="$TMP_WORKDIR/nemoclaw-e2e-daemon.json.absent", use
those variables wherever the script copies /etc/docker/daemon.json, and add a
trap to securely remove the temp dir on exit to avoid leaving state behind.

---

Duplicate comments:
In `@test/e2e/test-overlayfs-autofix.sh`:
- Around line 359-364: The positive-path gateway log check should be guarded
like the image check: first skip when patched_tag is empty, then verify the
gateway container exists and fail if it does not; only then run docker logs for
GATEWAY_CONTAINER and assert absence of "overlayfs.*snapshotter cannot be
enabled". Modify the block that references GATEWAY_CONTAINER to check
patched_tag (skip when empty), ensure the container is present (fail if
missing), and only then perform the docker logs | grep check to decide
fail/pass.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 9c339afb-5879-4ed7-83a2-c2b917d91ee8

📥 Commits

Reviewing files that changed from the base of the PR and between 6df1dea and 42a7714.

📒 Files selected for processing (2)
  • .github/workflows/nightly-e2e.yaml
  • test/e2e/test-overlayfs-autofix.sh
🚧 Files skipped from review as they are similar to previous changes (1)
  • .github/workflows/nightly-e2e.yaml

Comment thread test/e2e/test-overlayfs-autofix.sh Outdated
Three actionable findings:

1. cluster-image-patch.ts:119 — `FROM ubuntu:24.04` was a floating tag.
   The patched-image tag is a SHA over the Dockerfile *text*, so a
   floating base would let the same NemoClaw tag produce different
   image bytes across security updates. Pin to the current digest
   (sha256:c4a8d550…1c7b) via a new exported UBUNTU_BUILDER_DIGEST
   constant. Refresh procedure documented in the constant's docstring.
   New unit test asserts the FROM line matches
   `ubuntu:24.04@sha256:[0-9a-f]{64} AS bin-fetcher`.

2. cluster-image-patch.ts:200 — `docker image inspect` cache probe ran
   without a timeout, so a stuck Docker daemon could hang onboard
   before the pull/build timeouts ever applied. Add a third timeout
   (DEFAULT_INSPECT_TIMEOUT_MS = 30s, override via
   `inspectTimeoutMs`) and thread it into `imageExists`.

3. test-overlayfs-autofix.sh:98 — `/tmp/nemoclaw-e2e-daemon.json.bak`
   and the absent-marker path were predictable, so on a shared runner
   a pre-created symlink could redirect the subsequent `sudo cp` into
   an attacker-chosen path. Allocate a per-run STATE_DIR via
   `mktemp -d` (mode 0700) and put both files inside it. The EXIT
   trap wipes the dir on shutdown.

Verified locally: the new Dockerfile (with digest-pinned ubuntu
builder) still builds and produces a working `fuse-overlayfs --version`
binary in the final image.

Refs #2481.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
test/e2e/test-overlayfs-autofix.sh (1)

107-109: Consider using mktemp for transient log files as well.

INSTALL_LOG, ONBOARD_LOG_POSITIVE, and ONBOARD_LOG_NEGATIVE use predictable /tmp paths. While these are written by the current user (not via sudo), using unique paths would align with the security rationale applied to STATE_DIR.

Proposed fix
-INSTALL_LOG="${NEMOCLAW_E2E_INSTALL_LOG:-/tmp/nemoclaw-e2e-install.log}"
-ONBOARD_LOG_POSITIVE="/tmp/nemoclaw-e2e-onboard-positive.log"
-ONBOARD_LOG_NEGATIVE="/tmp/nemoclaw-e2e-onboard-negative.log"
+INSTALL_LOG="${NEMOCLAW_E2E_INSTALL_LOG:-${STATE_DIR}/install.log}"
+ONBOARD_LOG_POSITIVE="${STATE_DIR}/onboard-positive.log"
+ONBOARD_LOG_NEGATIVE="${STATE_DIR}/onboard-negative.log"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/test-overlayfs-autofix.sh` around lines 107 - 109, The three
transient log path variables INSTALL_LOG, ONBOARD_LOG_POSITIVE, and
ONBOARD_LOG_NEGATIVE currently use predictable /tmp filenames; change their
assignments to generate unique temporary files using mktemp (e.g., call mktemp
with a template or mktemp -t) so each variable is set to a securely created
unique path before use, and ensure any created temp files are cleaned up later;
update references to INSTALL_LOG, ONBOARD_LOG_POSITIVE, and ONBOARD_LOG_NEGATIVE
accordingly.
src/lib/cluster-image-patch.ts (1)

294-309: Consider using dynamic import() instead of require().

The require() call works but is a CommonJS pattern. If the codebase is moving toward ES modules, consider switching to dynamic import() for consistency. However, this is a minor style preference and doesn't affect functionality.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/cluster-image-patch.ts` around lines 294 - 309, Replace the CommonJS
require usage with a dynamic ES module import so the module loading matches ESM
style: change the top-level loader for the runner module (currently "const
runner: typeof import(\"./runner\") = require(\"./runner\");") to use dynamic
import and await it where this file initializes, then update uses in
defaultRunCapture and defaultRun to reference the imported runner; ensure the
import resolves to the same shape (runner.runCapture and runner.run) and handle
async initialization (e.g., initialize a cached promise or async initializer) so
defaultRunCapture and defaultRun still work when called.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/lib/cluster-image-patch.ts`:
- Around line 294-309: Replace the CommonJS require usage with a dynamic ES
module import so the module loading matches ESM style: change the top-level
loader for the runner module (currently "const runner: typeof
import(\"./runner\") = require(\"./runner\");") to use dynamic import and await
it where this file initializes, then update uses in defaultRunCapture and
defaultRun to reference the imported runner; ensure the import resolves to the
same shape (runner.runCapture and runner.run) and handle async initialization
(e.g., initialize a cached promise or async initializer) so defaultRunCapture
and defaultRun still work when called.

In `@test/e2e/test-overlayfs-autofix.sh`:
- Around line 107-109: The three transient log path variables INSTALL_LOG,
ONBOARD_LOG_POSITIVE, and ONBOARD_LOG_NEGATIVE currently use predictable /tmp
filenames; change their assignments to generate unique temporary files using
mktemp (e.g., call mktemp with a template or mktemp -t) so each variable is set
to a securely created unique path before use, and ensure any created temp files
are cleaned up later; update references to INSTALL_LOG, ONBOARD_LOG_POSITIVE,
and ONBOARD_LOG_NEGATIVE accordingly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4ffd18b7-4a67-4463-900f-7911a89a5e85

📥 Commits

Reviewing files that changed from the base of the PR and between 42a7714 and c90bd26.

📒 Files selected for processing (3)
  • src/lib/cluster-image-patch.test.ts
  • src/lib/cluster-image-patch.ts
  • test/e2e/test-overlayfs-autofix.sh

Eight findings from a fresh-context adversarial review. Resolutions:

1. Memoize applyOverlayfsAutoFix per upstream image. recoverGatewayRuntime's
   second call to getGatewayStartEnv now returns the cached patched-tag
   without re-running assessHost or re-attempting the build. (The
   reviewer's "45-minute retry storm" was overstated — pRetry captures
   gatewayEnv from outer scope, so the build attempt is once per
   startGatewayWithOptions, not once per retry — but the recovery-path
   redundancy is real and worth deduping.)

2. Bind the patched-image cache key to the upstream image's content
   digest. computePatchedTag now SHAs over (upstreamImage, upstreamDigest,
   snapshotter, dockerfile). ensurePatchedClusterImage resolves the
   digest via `docker image inspect <upstream>` (zero network cost when
   warm; air-gap-safe with pre-staged images). If the local upstream
   isn't there, a `docker manifest inspect` reachability probe (bounded
   by inspectTimeoutMs, default 30s) runs BEFORE the long pull, so
   air-gapped/restricted hosts fail in seconds with a documented error
   instead of hanging through a 10-minute pull timeout. New unit test:
   "differs when only the upstream digest changes". New unit test:
   "fails fast with a documented error when upstream is unreachable on a
   cache miss".

3. Same `docker manifest inspect` probe doubles as the air-gap UX fix
   from finding #3 — fast failure mode + actionable error message that
   points at the troubleshooting doc.

4. Exclude WSL2 hosts from hasNestedOverlayConflict. We don't have a
   confirmed reproducer there, and the WSL kernel's overlay story is
   different from bare Linux. Conservative: leave WSL on the upstream
   image. New unit test: "does not flag a WSL2 Linux host as a conflict".

5. applyOverlayfsAutoFix now logs a console.warn breadcrumb when
   assessHost throws, instead of silently returning null. Future
   regressions in host assessment won't make the auto-fix mysteriously
   stop firing without any user-visible signal.

6. Tighten the e2e negative phase. Was: "any non-zero install.sh exit"
   passes (SKIP on the canonical-error-string check). Now: requires at
   least one of three nested-overlay-failure signatures in the cluster
   log or the install log:
     - "overlayfs snapshotter cannot be enabled" (k3s init)
     - "CreateDiff: Canceled" (sandbox image build)
     - "failed to mount overlay" (catch-all)
   Otherwise FAIL. Distinguishes a real reproduction from unrelated
   flakes (NVIDIA_API_KEY rejection, GHCR rate-limit, daemon blip).

7. Docs note that switching the host's storage driver via daemon.json
   doesn't just kill running containers — it also rebuilds the entire
   local image graph, so previously-pulled images become unusable until
   re-pulled. Documented under the manual workaround.

8. parseDockerStorageDriver now falls back to the plain-text
   `Storage Driver: <name>` form. assessHost still passes
   `--format '{{json .}}'` (the canonical path), but a future caller
   injecting raw `docker info` output won't silently miss the conflict.
   New unit test for the plain-text fixture.

Local sanity-build of the patched Dockerfile (with `ubuntu:24.04` as the
UPSTREAM stand-in) still produces a working `fuse-overlayfs --version`
binary in the final image.

Refs #2481.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
@ericksoa ericksoa self-assigned this Apr 26, 2026
@ericksoa ericksoa added bug Something isn't working priority: high Important issue that should be resolved in the next release K3s labels Apr 26, 2026
The previous adversarial-review pass added a `!isWsl` gate to
hasNestedOverlayConflict, which is correct in production but
exposed a hidden assumption in two existing test fixtures:

  - "flags Docker 26+ containerd-snapshotter overlayfs as a nested
     overlay conflict"
  - "does not flag the legacy overlay2 driver as a conflict"

Both passed `platform: "linux"` and `env: {}` but left `release`
and `procVersion` to the real host. Most CI runners report
`6.x-generic` and `Linux version 6.x ...` for those, which doesn't
match the `/microsoft/i` test inside detectWsl, so isWsl was false
and the assertions held. The wsl-e2e CI job, however, runs on real
WSL where `os.release()` returns `5.15.x-microsoft-standard-WSL2`
and /proc/version contains "microsoft" — so isWsl flipped true, the
new gate masked hasNestedOverlayConflict, and the conflict-positive
assertion failed for the first time on the previous PR push.

Pin release and procVersion to non-WSL strings in both fixtures so
the assertions hold regardless of which host runs vitest. Add an
explicit `expect(result.isWsl).toBe(false)` in the positive case
so the hermeticity is asserted, not just relied on.

Refs #2481.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Latest nightly (24968749894) failed Phase 5 because this particular
GitHub-Actions ubuntu-latest runner instance did NOT reproduce the
nested-overlay bug under the upstream image. Without the auto-fix:

  Built image openshell/sandbox-from:1777243526
  Uploading image into OpenShell gateway...
  [progress] Exported 769 MiB
  Image openshell/sandbox-from:1777243526 is available in the gateway.
  Waiting for sandbox to become ready...
  Sandbox reported Ready before create stream exited; continuing.
  Waiting for NemoClaw dashboard to become ready...
  [our 300s timeout fires here]

No `CreateDiff: Canceled`, no canonical k3s string, no
`failed to mount overlay`. Just install.sh making slow progress
through what would normally be a successful run, killed by our
timeout while waiting for dashboard health. Compare to the previous
passing run (24966948867) where the same upstream image hit
`CreateDiff: Canceled` during sandbox build on a different runner
instance with a slightly different kernel/Docker patch level.

The previous tightening (e0d8df1) treated "exit non-zero, no
signature" as a flake — too strict. It conflates two distinct
states: (a) runner reproduces the bug at sandbox build time, but
the signature wasn't captured (real bug, would FAIL test); and (b)
runner does not reproduce the bug at all, install.sh just runs out
the timeout (legitimate runner-instance variance, should SKIP).

Distinguish using the `timeout` exit code:
  - exit == 124 (our wrapper fired)      → inconclusive
  - exit != 0 and != 124 (install died)  → genuine error

Three-way result:
  - signature present                    → PASS  (confirmed reproduction)
  - signature absent + exit == 124       → SKIP  (runner non-repro;
                                                  auto-fix correctness
                                                  still validated by
                                                  phases 3 and 4)
  - signature absent + exit != 124       → FAIL  (unrelated flake or
                                                  new failure mode)

Refs #2481.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
@ericksoa ericksoa merged commit 1f615e2 into main Apr 26, 2026
34 of 39 checks passed
@miyoungc miyoungc mentioned this pull request Apr 28, 2026
13 tasks
miyoungc added a commit that referenced this pull request Apr 28, 2026
## Summary
Refreshes user-facing docs for the last 24 hours of merged NemoClaw
history and bumps the docs metadata to 0.0.29, the next version after
v0.0.28. The updates are limited to behavior supported by merged PR
descriptions and diffs.

## Changes
- `docs/reference/commands.md`: documented `nemoclaw <name> policy-add
--from-file` and `--from-dir`, including custom preset review guidance,
from #2077 / commit `7720b175`.
- `docs/deployment/deploy-to-remote-gpu.md`: clarified that non-loopback
`CHAT_UI_URL` disables OpenClaw device pairing for remote browser-only
deployments, from #2449 / commit `f5ee8a4d`.
- `docs/inference/inference-options.md`: documented provider-aware
credential retry validation and the NVIDIA-only `nvapi-` prefix check,
from #2389 / commit `6f7f0c6d`.
- `docs/inference/switch-inference-providers.md`: documented
`NEMOCLAW_INFERENCE_INPUTS` for text/image-capable model metadata baked
into `openclaw.json`, from #2441 / commit `f4391892`.
- `docs/reference/troubleshooting.md`: added the Git certificate
verification entry for proxy CA propagation through `GIT_SSL_CAINFO`,
`GIT_SSL_CAPATH`, `CURL_CA_BUNDLE`, and `REQUESTS_CA_BUNDLE`, from #2345
/ commit `fa0dc1ab`.
- `docs/versions1.json` and `docs/project.json`: promoted docs version
`0.0.29`; `docs/versions1.json` omits unpublished `0.0.26`, `0.0.27`,
and `0.0.28` entries.
- `.agents/skills/nemoclaw-user-*`: regenerated derived user skill
references from the updated docs.
- Reviewed with no extra doc changes: #2575 / `d392ec07`, #2565 /
`a3231049`, #1965 / `db1ef3ca`, #1990 / `db665834`, #2495 / `7da86fa3`,
#2496 / `3192f4f4`, #2490 / `8c209058`, #2487 / `1f615e2f`, #2483 /
`5653d33a`, #2482 / `31c782c0`, #2464 / `23bb5703`, #2472 / `a54f9a34`,
and #2437 / `6bc860d7`.
- Skipped per docs policy: #2420 / `7b76df6b` touched the experimental
sandbox config path listed in `docs/.docs-skip`; #2466 / `cc15689c`
touched a skipped term and CI-only sandbox image files.

## Type of Change
- [ ] Code change (feature, bug fix, or refactor)
- [ ] Code change with doc updates
- [ ] Doc only (prose changes, no code sample modifications)
- [x] Doc only (includes code sample changes)

## Verification
<!-- Check each item you ran and confirmed. Leave unchecked items you
skipped. -->
- [x] `npx prek run --all-files` passes
- [ ] `npm test` passes — failed locally in installer-integration tests
and one onboard helper timeout; the doc-scoped hook test projects passed
under `prek`.
- [ ] Tests added or updated for new or changed behavior
- [x] No secrets, API keys, or credentials committed
- [x] Docs updated for user-facing behavior changes
- [ ] `make docs` builds without warnings (doc changes only) — build
succeeded, but local Sphinx emitted the existing version-switcher file
read message.
- [x] Doc pages follow the [style
guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md)
(doc changes only)
- [ ] New doc pages include SPDX header and frontmatter (new pages only)

## AI Disclosure
<!-- If an AI agent authored or co-authored this PR, check the box and
name the tool. Remove this section for fully human-authored PRs. -->
- [x] AI-assisted — tool: Codex

---
<!-- DCO sign-off required by CI. Run: git config user.name && git
config user.email -->
Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Support for custom YAML presets in policy configuration via
--from-file and --from-dir.
* New build-time inference input option to declare accepted modalities
(text or text,image).

* **Improvements**
* Credential validation now offers interactive recovery: re-enter key,
retry, choose another provider, or exit.
* Clarified provider-specific API key prefix handling (nvapi- only
applies to NVIDIA keys).

* **Documentation**
  * TLS certificate troubleshooting for inspected networks.
* Clarified remote dashboard security/device-pairing behavior; command
docs updated; docs version bumped.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working K3s priority: high Important issue that should be resolved in the next release

Projects

None yet

Development

Successfully merging this pull request may close these issues.

gateway start fails on OrbStack (macOS) — K3s containerd cannot mount nested overlayfs

2 participants