Skip to content

fix(onboard): debounce Docker GPU patch supervisor reconnect Error-phase short-circuit#4668

Merged
cv merged 4 commits into
mainfrom
fix-gpu-patch-reconnect-debounce
Jun 2, 2026
Merged

fix(onboard): debounce Docker GPU patch supervisor reconnect Error-phase short-circuit#4668
cv merged 4 commits into
mainfrom
fix-gpu-patch-reconnect-debounce

Conversation

@laitingsheng
Copy link
Copy Markdown
Contributor

@laitingsheng laitingsheng commented Jun 2, 2026

Summary

When OpenShell's Docker GPU patch recreates the sandbox container with --gpus all, the brief container churn (stop old → rename → run new) leaves the host's sandbox list cache reporting phase Error for a few seconds before the host re-registers the new container. The previous fast-fail short-circuit treats that transient Error as fatal on the very first poll, so a perfectly healthy GPU sandbox dies with OpenShell supervisor did not reconnect to the GPU-enabled container. within ~12 s — even though the new container is running, healthy, and the OCSF supervisor has already logged LIFECYCLE:INSTALL OpenShell Sandbox Supervisor success.

This PR debounces the Error-phase short-circuit: require K consecutive Error polls (default 5 ≈ 10 s sustained Error) before fast-failing. A patched container that actually crashes still fast-fails (~10 s instead of the original ~4 s); a transient teardown-Error during recreation no longer aborts the wait.

The supervisor-reconnect code path (constants, helpers, the reconnect wait, and its tests) is extracted into a focused docker-gpu-supervisor-reconnect.ts module with a source-of-truth boundary and removal condition documented in the module header.

Related Issue

Fixes #4664

Changes

  • New module src/lib/onboard/docker-gpu-supervisor-reconnect.ts owns the supervisor-reconnect wait, the timeout helper, and the new debounce helper. Header records the source-of-truth boundary (transient Error is an OpenShell sandbox-list cache artifact during recreation) and the removal condition (drop the debounce once OpenShell guarantees sandbox list skips Error during a known recreate, validated by a real-Docker GPU E2E that observes transient Error recovering to Ready).
  • src/lib/onboard/docker-gpu-patch.ts: replace inline reconnect helpers with imports + re-exports from the new module. recreateOpenShellDockerSandboxWithGpu now calls waitForOpenShellSupervisorReconnect directly. Net file delta vs main: −49 lines.
  • New env var NEMOCLAW_DOCKER_GPU_SUPERVISOR_RECONNECT_ERROR_DEBOUNCE (clamped ≥ 1, default 5) tunes the debounce window. DockerGpuSupervisorReconnectDeps.errorPhaseDebouncePolls lets tests inject a small K without touching env.
  • waitForOpenShellSupervisorReconnect tracks consecutive Error-phase polls and short-circuits only after errorPhaseDebouncePolls consecutive Error reads. Counter resets on any non-Error poll so flapping does not accumulate.
  • src/lib/onboard/docker-gpu-patch.test.ts: existing fast-fail test now asserts the explicit K=1 (no-debounce) behavior so the original intent is preserved when an operator opts out of the debounce. Net file delta vs main: +5 lines.
  • New src/lib/onboard/docker-gpu-supervisor-reconnect.test.ts covers the debounce state machine: transient Error window shorter than K → reconnect succeeds; sustained Error for K polls → still fast-fails; flapping phase resets the counter; env override + lower-bound clamp on getDockerGpuSupervisorReconnectErrorDebouncePolls.
  • docs/reference/troubleshooting.mdx and skills/nemoclaw-user-reference/references/troubleshooting.md: short troubleshooting note under "Docker GPU patch failed during sandbox create" describing the default debounce and when to tune the env var.

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • npx prek run --all-files passes
  • npm test passes — ran npx vitest run --project cli src/lib/onboard/docker-gpu-patch.test.ts src/lib/onboard/docker-gpu-supervisor-reconnect.test.ts (54/54 pass) and npm run typecheck:cli (clean). Full npx vitest run --project cli shows 5 pre-existing failures on main HEAD in unrelated files (src/lib/cli/command-registry.test.ts, test/cli.test.ts, test/whatsapp-qr-compact.test.ts). Required runtime validation dispatched: e2e-branch-validation:gpu (run 26819864315) and gpu-repo-local-ollama-openclaw (run 26819868701).
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes
  • npm run docs builds without warnings (doc changes only)
  • Doc pages follow the style guide (doc changes only)
  • New doc pages include SPDX header and frontmatter (new pages only)

Signed-off-by: Tinson Lai tinsonl@nvidia.com

Summary by CodeRabbit

  • New Features

    • Supervisor reconnect now debounces transient Error-phase detections to reduce false onboarding failures; debounce count and timeout are configurable (with sensible defaults and minimum clamping), and a no-debounce fast-fail behavior can be asserted via configuration.
  • Documentation

    • Added troubleshooting guidance describing reconnect behavior, default debounce, and how to adjust the debounce window.
  • Tests

    • Expanded tests covering debounce behavior, fast-fail vs. absorb scenarios, counter reset, and env-var handling.

…ase short-circuit

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 2, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 1f4d272c-b3c1-40ac-b58f-9cef1731e511

📥 Commits

Reviewing files that changed from the base of the PR and between 21013fc and 1b70103.

📒 Files selected for processing (3)
  • src/lib/onboard/docker-gpu-patch.test.ts
  • src/lib/onboard/docker-gpu-supervisor-reconnect.test.ts
  • src/lib/onboard/docker-gpu-supervisor-reconnect.ts
💤 Files with no reviewable changes (1)
  • src/lib/onboard/docker-gpu-patch.test.ts
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/lib/onboard/docker-gpu-supervisor-reconnect.ts
  • src/lib/onboard/docker-gpu-supervisor-reconnect.test.ts

📝 Walkthrough

Walkthrough

Adds a bounded supervisor-reconnect wait that inspects OpenShell sandbox phases and requires a configurable number of consecutive terminal Error-phase polls before failing fast. Exposes env/config getters and a deps override, wires the wait into Docker GPU sandbox recreation, adds tests for debounce behavior, and documents the new env toggle.

Changes

Supervisor-reconnect error-phase debounce

Layer / File(s) Summary
Configuration and public API
src/lib/onboard/docker-gpu-supervisor-reconnect.ts, src/lib/onboard/docker-gpu-patch.ts
Adds exported env constants and getters for reconnect timeout and error-phase debounce, re-exports the supervisor-reconnect entrypoint, removes local reconnect timeout constants from docker-gpu-patch, and extends DockerGpuPatchDeps with optional errorPhaseDebouncePolls to forward into the reconnect wait.
Supervisor-reconnect implementation & wiring
src/lib/onboard/docker-gpu-supervisor-reconnect.ts, src/lib/onboard/docker-gpu-patch.ts
Implements ANSI-aware parsing of openshell sandbox list, blocking sleep helper, and waitForOpenShellSupervisorReconnect which polls sandbox exec ... -- true until success, deadline, or a configurable number of consecutive detected terminal Error-phase polls; replaces the former local reconnect short-circuit by calling the new wait from sandbox recreation.
Tests and troubleshooting docs
src/lib/onboard/docker-gpu-supervisor-reconnect.test.ts, src/lib/onboard/docker-gpu-patch.test.ts, docs/reference/troubleshooting.mdx, skills/nemoclaw-user-reference/references/troubleshooting.md
Adds comprehensive Vitest coverage for transient Error absorption, sustained-Error fast-fail, counter reset on recovery, and env/default/clamping behavior; updates an existing fast-fail test to pass errorPhaseDebouncePolls: 1; documents the new NEMOCLAW_DOCKER_GPU_SUPERVISOR_RECONNECT_ERROR_DEBOUNCE override and its clamping rule.

Sequence Diagram

sequenceDiagram
  participant Recreate as recreateOpenShellDockerSandboxWithGpu
  participant Wait as waitForOpenShellSupervisorReconnect
  participant Exec as runOpenshell
  participant Capture as runCaptureOpenshell

  Recreate->>Wait: invoke(timeoutSecs, { errorPhaseDebouncePolls })
  loop poll until deadline or success
    Wait->>Exec: sandbox exec ... -- true
    alt Exec fails
      Wait->>Capture: openshell sandbox list
      Capture-->>Wait: sandbox phase (e.g., Error, Provisioning)
      Wait->>Wait: increment/reset consecutive-Error counter
    end
  end
  Wait-->>Recreate: return success|failure
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

Possibly related PRs

  • NVIDIA/NemoClaw#4407: Earlier PR that modified Error-phase short-circuit behavior in the supervisor reconnect flow; this PR introduces debouncing and rewiring.

Suggested labels

onboarding, Docker, Sandbox

Suggested reviewers

  • ericksoa

Poem

🐰 I hopped in to watch the reconnect race,
Tiny Errors now take a gentler pace.
We count a few polls before sounding alarm,
So transient glitches won't do much harm.
A patient rabbit cheers the steady calm.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately describes the main change: debouncing the Docker GPU patch supervisor reconnect Error-phase short-circuit to prevent false failures.
Linked Issues check ✅ Passed The PR fully addresses issue #4664: implements debouncing of transient Error-phase polls (K=5 by default), preserves fast-fail for sustained errors, provides env-var tuning, and includes comprehensive tests and documentation.
Out of Scope Changes check ✅ Passed All changes are in scope: new supervisor-reconnect module, refactored patch logic to use it, environment variable configuration, comprehensive test coverage, and troubleshooting documentation directly addressing the issue.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix-gpu-patch-reconnect-debounce

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 2, 2026

E2E Advisor Recommendation

Required E2E: gpu-e2e
Optional E2E: gpu-double-onboard-e2e, gpu-repo-local-ollama-openclaw

Dispatch hint: gpu-e2e

Auto-dispatched E2E: gpu-e2e via nightly-e2e.yaml at 1b70103e57963b21ea495b6488bba94f7866e654nightly run

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

  • gpu-e2e (high): This PR changes the real Docker GPU patch supervisor reconnect path used by GPU onboarding. The nightly gpu-e2e job runs install/onboard with NEMOCLAW_PROVIDER=ollama on an NVIDIA GPU runner, validates Docker/GPU availability, verifies onboard GPU proofs, and exercises inference through the patched sandbox.

Optional E2E

  • gpu-double-onboard-e2e (high): Useful adjacent confidence because it performs a second GPU/Ollama onboard on a fresh GPU runner and would exercise the same reconnect behavior during re-onboard, but the core changed path is already covered by gpu-e2e.
  • gpu-repo-local-ollama-openclaw (high): Typed scenario coverage for repo checkout + local Ollama OpenClaw on a Docker CDI GPU runner. This is complementary to nightly gpu-e2e and can catch scenario-registry/user-flow drift, but is not the primary merge-blocking check for this patch.

New E2E recommendations

  • docker-gpu-supervisor-reconnect (high): Existing GPU E2Es validate the happy path but do not deterministically force the transient OpenShell sandbox list Error phase that this debounce is designed to absorb. Add a targeted real-Docker/GPU E2E or scenario assertion that recreates the sandbox container, observes transient Error polls, then verifies reconnect and inference succeed before the debounce window expires.
    • Suggested test: docker-gpu-supervisor-reconnect-transient-error-e2e

Dispatch hint

  • Workflow: nightly-e2e.yaml
  • jobs input: gpu-e2e

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 2, 2026

E2E Scenario Advisor Recommendation

Required scenario E2E: gpu-repo-local-ollama-openclaw
Optional scenario E2E: None

Dispatch required scenario E2E:

  • gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=gpu-repo-local-ollama-openclaw

Workflow run

Full scenario advisor summary

E2E Scenario Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required scenario E2E

  • gpu-repo-local-ollama-openclaw: Changes affect the Docker GPU patch supervisor-reconnect path used during GPU sandbox onboarding. This is the only dispatchable scenario routed to a GPU runner with a GPU Docker runtime, so it is required despite using a special runner.
    • Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=gpu-repo-local-ollama-openclaw

Optional scenario E2E

  • None.

Relevant changed files

  • src/lib/onboard/docker-gpu-patch.ts
  • src/lib/onboard/docker-gpu-supervisor-reconnect.ts

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 2, 2026

PR Review Advisor

Findings: 0 needs attention, 1 worth checking, 0 nice ideas
Since last review: 2 prior items resolved, 1 still applies, 0 new items found

Review findings

🛠️ Needs attention

  • None.

🔎 Worth checking

  • Add targeted runtime validation for the Docker/OpenShell reconnect race (src/lib/onboard/docker-gpu-supervisor-reconnect.test.ts:18): The new unit tests cover the debounce state machine with mocked `sandbox exec` and `sandbox list` sequences, but the linked bug is an infrastructure timing race between Docker container recreation, OpenShell supervisor reconnect, and the sandbox-list cache. The changed files do not add or identify runtime/integration coverage proving that a real patched Docker GPU sandbox recovers from a transient Error phase, or that a genuinely crashed patched container still fails fast with diagnostics in the actual runtime path.
    • Recommendation: Add or identify targeted runtime/integration validation that recreates a Docker GPU sandbox and observes both paths: a transient Error phase recovers to Ready, and a truly failed patched container still surfaces Error-phase diagnostics without burning the full reconnect timeout. Do not rely only on mocked CLI output for this sandbox lifecycle path.
    • Evidence: Deterministic test-depth verdict is `runtime_validation_recommended`. The changed tests simulate `Error, Error, Provisioning, Ready`, sustained Error, counter reset, env override, clamp, and non-finite override cases, but no changed file adds runtime/integration validation for Docker/OpenShell supervisor reconnect.

🌱 Nice ideas

  • None.
Since last review details

Current findings:

  • Add targeted runtime validation for the Docker/OpenShell reconnect race (src/lib/onboard/docker-gpu-supervisor-reconnect.test.ts:18): The new unit tests cover the debounce state machine with mocked `sandbox exec` and `sandbox list` sequences, but the linked bug is an infrastructure timing race between Docker container recreation, OpenShell supervisor reconnect, and the sandbox-list cache. The changed files do not add or identify runtime/integration coverage proving that a real patched Docker GPU sandbox recovers from a transient Error phase, or that a genuinely crashed patched container still fails fast with diagnostics in the actual runtime path.
    • Recommendation: Add or identify targeted runtime/integration validation that recreates a Docker GPU sandbox and observes both paths: a transient Error phase recovers to Ready, and a truly failed patched container still surfaces Error-phase diagnostics without burning the full reconnect timeout. Do not rely only on mocked CLI output for this sandbox lifecycle path.
    • Evidence: Deterministic test-depth verdict is `runtime_validation_recommended`. The changed tests simulate `Error, Error, Provisioning, Ready`, sustained Error, counter reset, env override, clamp, and non-finite override cases, but no changed file adds runtime/integration validation for Docker/OpenShell supervisor reconnect.

Workflow run details

This is an automated advisory review. A human maintainer must make the final merge decision.

…odule + document env

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 2, 2026

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 2, 2026

Selective E2E Results — ⚠️ No requested jobs ran

Run: 26819949625
Target ref: c8bc1c44cbbb4c9ce8dca86310e4d375c9b21d7e
Workflow ref: main
Requested jobs: gpu-e2e
Summary: 0 passed, 0 failed, 1 skipped

Job Result
gpu-e2e ⏭️ skipped

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/onboard/docker-gpu-supervisor-reconnect.ts`:
- Around line 111-112: The injected override deps.errorPhaseDebouncePolls must
be clamped to the same minimum as the env-backed path; change the assignment for
errorPhaseDebouncePolls to normalize the injected value (e.g. use Math.max with
minimum 1) so that errorPhaseDebouncePolls = Math.max(1,
deps.errorPhaseDebouncePolls ??
getDockerGpuSupervisorReconnectErrorDebouncePolls()); this ensures both the deps
override and getDockerGpuSupervisorReconnectErrorDebouncePolls() honor the same
minimum contract.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b87059ca-56cd-4a48-830c-3297761637c2

📥 Commits

Reviewing files that changed from the base of the PR and between 13abdba and c8bc1c4.

📒 Files selected for processing (6)
  • docs/reference/troubleshooting.mdx
  • skills/nemoclaw-user-reference/references/troubleshooting.md
  • src/lib/onboard/docker-gpu-patch.test.ts
  • src/lib/onboard/docker-gpu-patch.ts
  • src/lib/onboard/docker-gpu-supervisor-reconnect.test.ts
  • src/lib/onboard/docker-gpu-supervisor-reconnect.ts

Comment thread src/lib/onboard/docker-gpu-supervisor-reconnect.ts Outdated
…o minimum 1

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
…es + trim EOF

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 2, 2026

Selective E2E Results — ⚠️ No requested jobs ran

Run: 26821866927
Target ref: 1b70103e57963b21ea495b6488bba94f7866e654
Workflow ref: main
Requested jobs: gpu-e2e
Summary: 0 passed, 0 failed, 1 skipped

Job Result
gpu-e2e ⏭️ skipped

@laitingsheng laitingsheng added the v0.0.57 Release target label Jun 2, 2026
Copy link
Copy Markdown
Contributor

@prekshivyas prekshivyas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

APPROVE.

The trailing-edge debounce with reset-on-recovery in waitForOpenShellSupervisorReconnect is the right model for #4664 — a transient sandbox list Error during container re-registration is absorbed, while a genuinely crashed container still fast-fails (~8s) instead of burning the full timeout. Tests inject a mocked sleep and the poll source (no real wall-clock dependence) and cover the window boundary, flapping-reset, clamp, and non-finite-override cases with exact poll/sleep counts. The CodeRabbit override-clamp gap is fixed and regression-tested. CI green on 1b70103, thread resolved in head.

Non-blocking cleanup: TERMINAL_SANDBOX_FAILURE_PHASES and parseSandboxListFailurePhase in the new module duplicate SANDBOX_FAILURE_PHASE_TOKENS / parseSandboxPhaseFromListOutput still in docker-gpu-patch.ts — values match today but can silently drift; consider exporting one canonical set. Doc nit: default K=5 is ~8s of sleeps (4×2s), not 10s.

Signed-off-by: Prekshi Vyas prekshiv@nvidia.com

@cv cv merged commit a2c020d into main Jun 2, 2026
34 of 35 checks passed
@cv cv deleted the fix-gpu-patch-reconnect-debounce branch June 2, 2026 16:07
@prekshivyas prekshivyas self-assigned this Jun 2, 2026
@wscurran wscurran added bug-fix PR fixes a bug or regression and removed fix labels Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug-fix PR fixes a bug or regression v0.0.57 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[WSL2 x86_64][Sandbox] OpenShell supervisor fails to reconnect to GPU-patched sandbox container; sandbox enters Error phase

4 participants