Skip to content

fix(onboard,uninstall): replace misleading recovery messages (#3456)#3520

Merged
cv merged 4 commits into
mainfrom
fix/3456-misleading-recovery-messages
May 14, 2026
Merged

fix(onboard,uninstall): replace misleading recovery messages (#3456)#3520
cv merged 4 commits into
mainfrom
fix/3456-misleading-recovery-messages

Conversation

@cjagwani
Copy link
Copy Markdown
Contributor

Draft for visibility. Issue-autopilot Stages 4-5 of #3456. Will mark ready once batch self-review + CI complete.

Summary

Closes the two remaining output threads in #3456 after the core dead-loop fix already landed on main (via #3459, #3434, #3483). Full sub-bug mapping in the #3456 status comment.

Acceptance criteria mapping

Sub-bug Resolution Evidence
#1 dead loop Already fixed on main (#3459) out of scope
#2 firewall diagnostic Already fixed on main (#3459) out of scope
#3 literal <name> placeholder This PR src/lib/onboard/gpu-recovery.ts + onboard.ts:10387-10405
#4 misleading "skipped" wording This PR src/lib/actions/uninstall/run-plan.ts:210-228, 407-414
#5 uninstall residuals Already fixed on main (#3483) out of scope

Behavior matrix

gpuPassthroughRecoveryLines(names):

Input Suggestion
null / [] nemoclaw uninstall && nemoclaw onboard --gpu
one sandbox nemoclaw <name> destroy --yes --cleanup-gateway && nemoclaw onboard --gpu
many sandboxes each destroy --yes, only the last gets --cleanup-gateway

Test plan

npm run typecheck:cli
npx vitest run src/lib/onboard/gpu-recovery.test.ts src/lib/actions/uninstall/run-plan.test.ts

22 tests pass (6 new + 16 existing).

Notes for reviewers

  • This is the work #3464 attempted; that PR was closed without merging after CodeRabbit asked for the <name> placeholder to be forbidden in tests via negative assertion. This PR adopts that refinement.
  • runOptional extension is backwards-compatible — existing callers without onSkip get the original wording.

Closes #3456 once merged.

Resolve the two output threads in #3456 left after the core dead-loop fix
landed via #3459 + #3434:

Sub-bug #3 — `src/lib/onboard.ts` printed
  `nemoclaw <name> destroy --yes && nemoclaw onboard --gpu`
with a literal `<name>` placeholder, and assumed at least one sandbox
was registered. When the GPU-passthrough mismatch hit on the State B
re-run path with an empty registry (the dead-loop case), the hint was
not actionable. Replace with a registry-aware helper at
`src/lib/onboard/gpu-recovery.ts` that renders the right shape:
  - empty registry → suggest `nemoclaw uninstall && nemoclaw onboard --gpu`
  - one sandbox → suggest destroy --yes --cleanup-gateway for that name
  - multiple sandboxes → list each, only the last gets --cleanup-gateway

Sub-bug #4 — `src/lib/actions/uninstall/run-plan.ts` printed
  `Destroyed gateway 'nemoclaw' skipped`
when the openshell destroy no-op'd (gateway already gone) — the
"Destroyed … skipped" wording was self-contradictory. Extend
`runOptional` with an `onSkip` option; route the gateway destroy to
emit `Gateway 'nemoclaw' already removed or unreachable` on no-op.

Tests:
- `src/lib/onboard/gpu-recovery.test.ts` (6 tests): forbid literal
  `<name>` placeholder anywhere in the output; cover empty / single /
  multi-sandbox cases; defensive filter on whitespace names so a
  `nemoclaw  destroy` rendering can never happen.
- `src/lib/actions/uninstall/run-plan.test.ts`: assert the new
  "already removed or unreachable" wording and the absence of the
  "Destroyed gateway 'nemoclaw' skipped" string.

The core dead loop itself (sub-bugs #1, #2 and State B GPU mismatch)
is already addressed by #3459 + #3434 + #3483; #3456 will close once
this lands. See the #3456 status comment for the full mapping.

Refs #3456. Mirrors (and tightens) the approach in the closed PR #3464,
which left the literal `<name>` placeholder in tests per CodeRabbit
feedback that was never addressed.

Signed-off-by: Charan Jagwani <charjags100@gmail.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 14, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 14, 2026

Caution

Review failed

The head commit changed during the review from b592bda to 64c1ac6.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/3456-misleading-recovery-messages

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 14, 2026

E2E Advisor Recommendation

Required E2E: gpu-e2e, deployment-services-e2e
Optional E2E: gpu-double-onboard-e2e, sandbox-operations-e2e

Dispatch hint: gpu-e2e,deployment-services-e2e

Auto-dispatched E2E: gpu-e2e, deployment-services-e2e via nightly-e2e.yaml at 64c1ac6840f34c032ca1974023cfdabafe2b77b2

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

  • gpu-e2e (high; requires linux-amd64-gpu-rtxpro6000-latest-1 GPU runner and Ollama/model setup): Validates the real GPU/Ollama onboard path on a GPU runner, including gateway creation with GPU passthrough and end-to-end sandbox inference. This is the closest existing E2E coverage for the onboard GPU path modified by the new recovery helper.
  • deployment-services-e2e (medium-high; live cloud onboard with NVIDIA_API_KEY and destructive uninstall at the end): Exercises destructive lifecycle cleanup including nemoclaw uninstall --keep-openshell --yes, which is the existing E2E path covering uninstall behavior touched by the gateway destroy skip-message change.

Optional E2E

  • gpu-double-onboard-e2e (high; requires GPU runner and repeated Ollama onboard): Useful adjacent confidence for re-onboard/reuse behavior with Ollama on a GPU host. It does not appear to force the exact missing-GPU-passthrough recovery branch, but it stresses repeated onboard with an existing GPU gateway and auth proxy.
  • sandbox-operations-e2e (high; creates and manipulates live sandboxes/gateway): Adjacent lifecycle confidence for destroy, registry, gateway recovery, and multi-sandbox operations. Useful because the new GPU recovery hint depends on registered sandbox names, but not directly merge-blocking for this narrow message/helper change.

New E2E recommendations

  • onboarding-gpu-gateway-recovery (high): Existing GPU E2Es validate successful GPU onboard/re-onboard, but they do not intentionally create a reusable OpenShell gateway without GPU passthrough and then run nemoclaw onboard --gpu to verify the actionable recovery output for empty, single, and multiple sandbox registries.
    • Suggested test: Add a targeted GPU recovery E2E scenario that creates or simulates a non-GPU reusable gateway, invokes onboard with GPU requested, and asserts the recovery hint contains nemoclaw uninstall for an empty registry or the correct destroy --yes --cleanup-gateway commands for registered sandboxes without rendering <name>.
  • uninstall-and-openshell-gateway-cleanup (medium): The unit test covers the new wording when openshell gateway destroy no-ops, but existing uninstall E2E coverage does not appear to assert the already-removed/unreachable gateway message or absence of the contradictory Destroyed gateway ... skipped text.
    • Suggested test: Extend the uninstall/deployment-services E2E or add a focused uninstall E2E case that removes the gateway before running nemoclaw uninstall --yes, then asserts the uninstall log reports Gateway 'nemoclaw' already removed or unreachable and never prints Destroyed gateway 'nemoclaw' skipped.

Dispatch hint

  • Workflow: nightly-e2e.yaml
  • jobs input: gpu-e2e,deployment-services-e2e

@wscurran wscurran added the v0.0.42 Release target label May 14, 2026
…nto gpu-recovery (#3456)

The first commit grew `src/lib/onboard.ts` by +13 lines, which trips the
`onboard-entrypoint-budget` policy on `main` (entrypoint must be
net-neutral or smaller; new logic belongs under `src/lib/onboard/**`).

Extract the inline registry-lookup + emit loop into two new exports in
`src/lib/onboard/gpu-recovery.ts`:

- `getRegisteredSandboxNamesForGpuRecovery()` — registry read with a
  graceful empty-list fallback.
- `reportGpuPassthroughRecovery(emit, loadNames?)` — emits the hint
  lines via `emit`. `loadNames` defaults to the registry reader; tests
  inject their own list.

The onboard.ts callsite collapses from 14 lines to 1:
  `reportGpuPassthroughRecovery(console.error);`

Net change in `src/lib/onboard.ts`: -1 line. Budget passes.

Adds two tests for the new wrapper covering empty-registry and
multi-sandbox cases. 24 tests pass (was 22).

Signed-off-by: Charan Jagwani <charjags100@gmail.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 14, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

`.agents/skills/nemoclaw-maintainer-issue-autopilot/SKILL.md` was
accidentally included in the previous commit (afe5427). It's a
local-only maintainer skill, not relevant to the #3456 fix. Drop it
from the PR so the Stage 9 perfect-match audit shows no surplus.

Signed-off-by: Charan Jagwani <charjags100@gmail.com>
@cjagwani cjagwani marked this pull request as ready for review May 14, 2026 17:47
@cjagwani cjagwani added bug Something isn't working priority: high Important issue that should be resolved in the next release NV QA Bugs found by the NVIDIA QA Team UAT Issues flagged for User Acceptance Testing. labels May 14, 2026
@cjagwani
Copy link
Copy Markdown
Contributor Author

READY FOR HUMAN REVIEW

Stage 9 perfect-match acceptance audit, with no surplus.

Acceptance clause → evidence

# Clause from #3456 Status Evidence
1 Gateway / host-gateway bind mismatch causing "firewall is blocking" MET (upstream #3459 on main) src/lib/onboard/gateway-sandbox-reachability.ts
2 Hard-coded "host firewall is blocking" diagnostic MET (upstream #3459 on main) Same file — probe_unavailable classification + conditional UFW hint
3 nemoclaw <name> destroy --yes not actionable when no sandbox registered MET (this PR) src/lib/onboard/gpu-recovery.ts + src/lib/onboard.ts:10387-10389
4 Destroyed gateway 'nemoclaw' skipped self-contradictory wording MET (this PR) src/lib/actions/uninstall/run-plan.ts:210-228, 407-414
5 Uninstall residuals (procs, state dir, docker network) MET (upstream #3483 on main) src/lib/domain/uninstall/plan.tsstop-helper-services, stop-openshell-forward-processes, stop-orphaned-openshell-processes, delete-docker-volume, delete-related-docker-containers, delete-related-docker-images
State B GPU mismatch "Existing gateway started without GPU passthrough" recovery flow MET (upstream #3434 on main) NemoClaw Docker GPU patch — sandbox created without OpenShell --gpu, recreated with Docker GPU flags

Surplus scan

No surplus. PR-scoped diff: 5 files, +247/-12, every file traces to clause #3 or #4.

Final gates

  • ✅ All checks green (17/17): commit-lint, dco-check, legacy-paths, onboard-entrypoint-budget, changes, cli-parity, markdown-links, Pi-E2E-advisor, macos-e2e, CodeQL (js-ts/python), check-hash, checks, ShellCheck, test-e2e-ollama-proxy, CodeRabbit (skipped while draft — will re-engage now that the PR is ready).
  • ✅ Every commit signed (%G? = G), author = Charan Jagwani <cjagwani@nvidia.com> mirror via DCO.
  • npm run typecheck:cli clean.
  • ✅ 24 targeted unit tests pass (6 new gpu-recovery.test.ts + 1 new run-plan.test.ts + 17 existing).
  • ✅ Labels applied: bug, priority: high, NV QA, UAT, v0.0.42 (auto-targeted).

RFR draft (cc reviewers): PR closes the two remaining output threads in #3456 — registry-aware GPU-passthrough recovery hint + non-contradictory uninstall destroy wording. Five files, +247/-12. Adopts the CodeRabbit refinement from the closed #3464 (forbid <name> placeholder in tests via negative assertion). The other three sub-bugs in #3456 are already on main via #3459 + #3434 + #3483.

@github-actions
Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 25875995216
Target ref: 64c1ac6840f34c032ca1974023cfdabafe2b77b2
Workflow ref: main
Requested jobs: gpu-e2e,deployment-services-e2e
Summary: 1 passed, 0 failed, 1 skipped

Job Result
deployment-services-e2e ✅ success
gpu-e2e ⏭️ skipped

@cv cv merged commit e6e0e1d into main May 14, 2026
17 checks passed
cv added a commit that referenced this pull request May 14, 2026
…3536)

## Summary

Adds `~/.local/state/nemoclaw/` to the uninstall plan so `nemoclaw
uninstall` cleans the Linux Docker-driver gateway's state directory (pid
file, SQLite db, audit log, `vm-driver/` state). Closes #3535 — the
verified residual sub-bug #5c from #3456, which auto-closed when #3520
merged.

## Acceptance criteria mapping

| Clause from #3535 | Evidence |
|---|---|
| `~/.local/state/nemoclaw/` is removed by `nemoclaw uninstall` on Linux
| `src/lib/actions/uninstall/run-plan.ts:588` adds
`removePath(paths.gatewayLocalStateDir, runtime)` to the `executePlan`
removal block |
| `paths.test.ts` confirms the path is in `uninstallStatePaths()` return
| `src/lib/domain/uninstall/paths.test.ts` — new test `#3456: exposes
the Linux Docker-driver gateway state dir so uninstall can clean it` |
| No regression in existing uninstall tests | `npx vitest run
src/lib/domain/uninstall/paths.test.ts
src/lib/domain/uninstall/plan.test.ts
src/lib/actions/uninstall/run-plan.test.ts` — 22 pass (1 new + 21
existing) |

## Behavior matrix

| Before | After |
|---|---|
| `uninstallStatePaths()` returns 3 dirs: `~/.nemoclaw`,
`~/.config/openshell`, `~/.config/nemoclaw` | Returns 4: adds
`~/.local/state/nemoclaw` |
| `executePlan` removes 3 of 4 NemoClaw-owned dirs; leaves Linux gateway
state behind | Removes all 4 |
| `defaultUninstallPaths()` does not surface `gatewayLocalStateDir` |
Surfaces it (matches `NEMOCLAW_OPENSHELL_GATEWAY_STATE_DIR` doc) |

## Test plan

```
npm run typecheck:cli
npx vitest run src/lib/domain/uninstall/paths.test.ts src/lib/domain/uninstall/plan.test.ts src/lib/actions/uninstall/run-plan.test.ts
```

Result: 22/22 pass.

## Notes for reviewers

- Path source-of-truth: `docs/reference/commands.md:1150` documents
`NEMOCLAW_OPENSHELL_GATEWAY_STATE_DIR` as
`~/.local/state/nemoclaw/openshell-docker-gateway`. The parent dir
(`~/.local/state/nemoclaw/`) is what we own; cleaning the parent removes
the gateway dir + `vm-driver/` + future siblings.
- The `Pick<>` on `uninstallStatePaths` was extended to include the new
field; existing callers in `executePlan` already pass full
`UninstallPaths` so they pick it up automatically.
- Out of scope: sub-bug #5d (`.openshell-installed-version`). My
follow-up comment on #3456 explains — no source in NemoClaw writes that
sentinel on current `main`, so filing a speculative fix isn't
appropriate yet.

Closes #3535
Refs #3456

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Bug Fixes**
* Uninstall now also removes the Linux gateway local state directory,
ensuring more complete cleanup on uninstall.

* **Tests**
* Added a regression test to verify the gateway local state directory is
exposed and included in uninstall cleanup.

<!-- review_stack_entry_start -->

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3536)

<!-- review_stack_entry_end -->
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Carlos Villela <cvillela@nvidia.com>
@miyoungc miyoungc mentioned this pull request May 14, 2026
12 tasks
miyoungc added a commit that referenced this pull request May 14, 2026
## Summary
Refreshes the NemoClaw documentation for the local `main` changes
included in the 0.0.42 release. The update adds release notes, updates
the affected user-facing setup and troubleshooting pages, bumps docs
metadata to 0.0.42, and regenerates the matching user skills.

## Changes
- #3537 -> `docs/reference/commands.md`,
`docs/reference/troubleshooting.md`: Documented host-level status
fields, cloudflared state-specific recovery hints, and Local Ollama auth
proxy status diagnostics.
- #3454 -> `docs/get-started/prerequisites.md`,
`docs/get-started/quickstart.md`: Documented macOS Docker-driver
onboarding and removed the expectation that standard macOS setup needs a
VM driver helper.
- #3514 -> `docs/inference/use-local-inference.md`: Documented
compatible-endpoint retry behavior for reasoning-only smoke responses.
- #3448 -> `docs/reference/commands.md`,
`docs/manage-sandboxes/messaging-channels.md`: Documented canonical
channel names and policy preset hints after `channels add`.
- #3520 -> `docs/about/release-notes.md`: Captured clearer GPU recovery
and uninstall wording in the 0.0.42 release notes.
- #3313 -> `docs/get-started/quickstart.md`,
`docs/reference/troubleshooting.md`: Documented stronger dashboard port
detection and rollback when a forward cannot start.
- #3502 -> `docs/about/release-notes.md`: Captured batched onboarding
policy preset application in the 0.0.42 release notes.
- #3505 -> `docs/reference/troubleshooting.md`: Documented the top-level
Colima socket path.
- #3421 -> `docs/about/release-notes.md`: Captured idempotent installer
shim logging in the 0.0.42 release notes.
- Updated `docs/project.json`, `docs/versions1.json`, and regenerated
`.agents/skills/nemoclaw-user-*` outputs.

## Type of Change
- [ ] Code change (feature, bug fix, or refactor)
- [ ] Code change with doc updates
- [x] Doc only (prose changes, no code sample modifications)
- [ ] Doc only (includes code sample changes)

## Verification
- [ ] `npx prek run --all-files` passes
- [ ] `npm test` passes
- [ ] Tests added or updated for new or changed behavior
- [x] No secrets, API keys, or credentials committed
- [x] Docs updated for user-facing behavior changes
- [x] `make docs` builds without warnings (doc changes only)
- [x] Doc pages follow the [style
guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md)
(doc changes only)
- [ ] New doc pages include SPDX header and frontmatter (new pages only)

---
Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

## Release Notes - v0.0.42

* **Documentation**
  * Enhanced macOS onboarding guidance for Docker gateway setup
  * Improved dashboard port conflict handling with automatic rollback
* Better local Ollama inference diagnostics and authentication proxy
checks
  * Clarified status command output and recovery procedures
  * Refined messaging channel setup documentation

* **Chores**
  * Version bump to 0.0.42

<!-- review_stack_entry_start -->

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3540)

<!-- review_stack_entry_end -->

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Co-authored-by: Carlos Villela <cvillela@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working NV QA Bugs found by the NVIDIA QA Team priority: high Important issue that should be resolved in the next release UAT Issues flagged for User Acceptance Testing. v0.0.42 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[All Platforms][Onboard] ./install.sh and nemoclaw onboard enter dead loop — sandbox can't reach gateway, "firewall" hint is misleading

4 participants