Skip to content

fix(cli): resolve sandbox env for global status service lookup#4756

Merged
cv merged 8 commits into
NVIDIA:mainfrom
Thabhelo:fix/status-resolve-sandbox-env-1077
Jun 5, 2026
Merged

fix(cli): resolve sandbox env for global status service lookup#4756
cv merged 8 commits into
NVIDIA:mainfrom
Thabhelo:fix/status-resolve-sandbox-env-1077

Conversation

@Thabhelo
Copy link
Copy Markdown
Contributor

@Thabhelo Thabhelo commented Jun 4, 2026

Summary

Global nemoclaw status and status --json now resolve the target sandbox the same way as nemoclaw start / stop, so host service PID lookup honors SANDBOX_NAME, NEMOCLAW_SANDBOX, and NEMOCLAW_SANDBOX_NAME when set.

Fixes #1077

Changes

  • Use resolveDefaultSandboxName in showStatusCommand and getStatusReport
  • Unit tests for env wiring; integration tests in services.test.ts use real PID dirs under /tmp/nemoclaw-services-*

Test plan

  • npm test -- src/lib/inventory/index.test.ts
  • npm test -- src/lib/tunnel/services.test.ts -t '#1077'
  • Manual DGX/WSL steps: comment

Signed-off-by: Thabhelo 50872400+Thabhelo@users.noreply.github.com

Summary by CodeRabbit

  • Bug Fixes

    • Improved accuracy of default sandbox identification in status reports.
    • Enhanced service status detection for better reliability.
  • Tests

    • Added comprehensive test coverage for sandbox-name resolution in status workflows and service status reporting.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Jun 4, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 4, 2026

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

The PR integrates environment-variable based default sandbox resolution into inventory status flows. It adds resolveDefaultSandboxName import usage to compute the effective default sandbox from environment variables, propagates the resolved default through status reporting and service monitoring, and validates the behavior with comprehensive tests covering both inventory and tunnel PID-directory scenarios.

Changes

Default Sandbox Resolution Integration

Layer / File(s) Summary
Add resolveDefaultSandboxName imports
src/lib/inventory/index.ts, src/lib/tunnel/services.test.ts
Import resolveDefaultSandboxName to enable environment-driven default sandbox resolution in both inventory status flows and tunnel service tests.
Refactor inventory status flows to use resolved defaults
src/lib/inventory/index.ts
Compute resolvedDefault via resolveDefaultSandboxName in getStatusReport and showStatusCommand; propagate the resolved default through service status selection (showServiceStatus, getServiceStatuses), messaging-bridge health checks, and gateway log retrieval (readGatewayLog).
Inventory sandbox resolution test coverage
src/lib/inventory/index.test.ts
Add Vitest cases validating that showStatusCommand and getStatusReport reuse listSandboxes snapshots when resolving defaults, and that environment variables SANDBOX_NAME and NEMOCLAW_SANDBOX_NAME drive the resolved default for status selection and report content, with proper env cleanup.
Tunnel service PID-directory matching tests (#1077)
src/lib/tunnel/services.test.ts
Add integration helpers defining sandbox-specific PID directories and a #1077 test suite that seeds cloudflared.pid files, snapshots/restores environment variables, and asserts getServiceStatuses and showStatus correctly report running/stopped state when the resolved sandbox name matches the active PID directory.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested labels

NemoClaw CLI, fix, area: cli, Sandbox

Suggested reviewers

  • cv
  • prekshivyas

🐰 A sandbox named, a service found,
When env vars guide the status round,
No more lost rabbits in the dark—
Default resolved, a clarifying spark! 🌟

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: resolving sandbox environment variables for global status service lookup, which directly addresses the linked issue.
Linked Issues check ✅ Passed The PR implementation fully addresses issue #1077 by ensuring nemoclaw status resolves sandbox names using environment variables (SANDBOX_NAME, NEMOCLAW_SANDBOX_NAME) consistently with start/stop commands.
Out of Scope Changes check ✅ Passed All changes are scoped to status sandbox resolution: adding resolveDefaultSandboxName usage in status commands, updating test coverage, and adding integration tests for PID directory validation.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/inventory/index.test.ts`:
- Around line 546-565: The test for SANDBOX_NAME should also isolate
higher-precedence environment variables NEMOCLAW_SANDBOX_NAME and
NEMOCLAW_SANDBOX to avoid flakiness: in the test wrapping the call to
showStatusCommand (the "resolves service status sandbox from SANDBOX_NAME env"
case), save current values of process.env.NEMOCLAW_SANDBOX_NAME and
process.env.NEMOCLAW_SANDBOX before setting process.env.SANDBOX_NAME, then clear
or delete those NemoClaw variables for the duration of the test, call
showStatusCommand as before, and finally restore the original
NEMOCLAW_SANDBOX_NAME and NEMOCLAW_SANDBOX values in the same finally block
along with SANDBOX_NAME to ensure the environment is fully restored.

In `@src/lib/inventory/index.ts`:
- Around line 403-404: The code computes resolvedDefault via
resolveDefaultSandboxName(deps.listSandboxes) but later uses raw defaultSandbox
for messaging health/logs, causing inconsistent sandbox selection; update all
code paths in showStatusCommand and the related health/logs branches (where
defaultSandbox is used — e.g., the blocks handling messaging health/logs and the
sandboxes length branch) to use resolvedDefault instead of defaultSandbox so the
same resolved sandbox is used for both table/service output and health/logs;
ensure any variable passed into functions that fetch health or logs (messaging
health, messaging logs) receives resolvedDefault and keep
resolveDefaultSandboxName(deps.listSandboxes) as the single source of truth.
- Around line 365-367: The destructured variable defaultSandbox in
getStatusReport is unused; update the destructuring of deps.listSandboxes() to
either remove defaultSandbox or rename it to _defaultSandbox to satisfy lint
rules, ensuring resolvedDefault continues to use
resolveDefaultSandboxName(deps.listSandboxes) and liveInference calculation
(sandboxes/getLiveInference) remains unchanged; adjust the const line that
currently declares { sandboxes, defaultSandbox } accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e8a8385b-b28a-457b-9889-3b71eb04cbd2

📥 Commits

Reviewing files that changed from the base of the PR and between 17734b1 and 7e06f24.

📒 Files selected for processing (2)
  • src/lib/inventory/index.test.ts
  • src/lib/inventory/index.ts

Comment thread src/lib/inventory/index.test.ts
Comment thread src/lib/inventory/index.ts Outdated
Comment thread src/lib/inventory/index.ts Outdated
@Thabhelo
Copy link
Copy Markdown
Contributor Author

Thabhelo commented Jun 4, 2026

Manual repro (DGX / WSL — matches #1077 reports)

  1. Onboard or use an existing sandbox (e.g. research); confirm nemoclaw list shows it as default (*).
  2. export TELEGRAM_BOT_TOKEN=... if needed, then nemoclaw start (or nemoclaw tunnel start on current CLI).
  3. Without env: unset SANDBOX_NAME NEMOCLAW_SANDBOX_NAME; nemoclaw status — host cloudflared line should reflect the registry default sandbox PID dir (/tmp/nemoclaw-services-<default>*).
  4. With env (issue workaround): export SANDBOX_NAME=<your-sandbox>; nemoclaw status — should match step 3 when names align; fixes the case where env pointed at a different dir than status used to query.
  5. dknos .nemoclaw_env pattern: set -a; source "$HOME/.nemoclaw_env"; set +a; nemoclaw status — must be sourced in the same shell; the CLI does not auto-load that file.

* On builds after host telegram-bridge removal (PR #1081), status host lines are cloudflared-only; the PID-dir bug class is the same.

Out of scope for this PR: early SANDBOX_NAME validation in bridge startup (dknos), and cases where export is set but status still shows stopped (raymondlo84 — needs separate triage).

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/tunnel/services.test.ts`:
- Around line 149-159: The tests fail when an external NEMOCLAW_SANDBOX env var
interferes; update the suite to snapshot and restore the actual NEMOCLAW_SANDBOX
env var (not just NEMOCLAW_SANDBOX_NAME) so resolveDefaultSandboxName() uses the
expected env during tests: create a savedNemoclawSandbox (or extend
savedNemoclawSandboxName) variable at top, clear process.env.NEMOCLAW_SANDBOX in
the test setup/teardown and restore it in afterEach (mirror how savedSandboxName
/ process.env.SANDBOX_NAME are handled), and apply the same change in the other
affected blocks (the other afterEach/beforeEach locations referenced) so tests
consistently control NEMOCLAW_SANDBOX.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 6a001269-81c0-45dd-becc-10c5739d3b8c

📥 Commits

Reviewing files that changed from the base of the PR and between 7e06f24 and 9ef88e6.

📒 Files selected for processing (1)
  • src/lib/tunnel/services.test.ts

Comment thread src/lib/tunnel/services.test.ts
Thabhelo added 3 commits June 4, 2026 01:42
Align nemoclaw status with start/stop by using resolveDefaultSandboxName
so host service PID dirs match SANDBOX_NAME and NEMOCLAW_SANDBOX_NAME.

Fixes NVIDIA#1077

Signed-off-by: Thabhelo <50872400+Thabhelo@users.noreply.github.com>
Exercise real getServiceStatuses/showStatus against /tmp/nemoclaw-services-*
dirs to prove status must pass env-resolved sandbox names, not registry only.

Signed-off-by: Thabhelo <50872400+Thabhelo@users.noreply.github.com>
Use resolvedDefault for messaging health/logs, harden env isolation in
tests, and gitignore local contribution-research notes.

Signed-off-by: Thabhelo <50872400+Thabhelo@users.noreply.github.com>
@Thabhelo Thabhelo force-pushed the fix/status-resolve-sandbox-env-1077 branch from 9ef88e6 to 2e45d41 Compare June 4, 2026 05:43
@cv cv added the v0.0.60 Release target label Jun 5, 2026
Fixes strict typecheck failure in CI checks job.

Signed-off-by: Thabhelo <50872400+Thabhelo@users.noreply.github.com>
Comment thread .gitignore Outdated
@wscurran wscurran added area: cli Command line interface, flags, terminal UX, or output bug-fix PR fixes a bug or regression labels Jun 5, 2026
@wscurran
Copy link
Copy Markdown
Contributor

wscurran commented Jun 5, 2026

✨ Thanks for submitting this detailed PR about fixing the sandbox env resolution for global status service lookup, which improves the consistency of the status and health flows. This proposes a bug fix in the CLI that affects the sandbox management and error reporting.


Related open issues:

@prekshivyas prekshivyas self-assigned this Jun 5, 2026
Copy link
Copy Markdown
Contributor

@prekshivyas prekshivyas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving. Verified the fix resolves #1077 correctly:

  • Both status paths (getStatusReport, showStatusCommand) now use resolveDefaultSandboxName consistently — including the messaging-bridge-health and gateway-log calls — and it's the same resolver runStartCommand/runStopCommand use (NEMOCLAW_SANDBOX_NAME ?? NEMOCLAW_SANDBOX ?? SANDBOX_NAME, regex-validated, else the registry default). So global status/status --json now honor the env override exactly like start/stop, and the no-env case is unchanged (just hardened with the safe-name check).

A couple of non-blocking notes:

  • The remaining raw-default use is in getSandboxInventory (line ~202), which backs nemoclaw list — a separate command from status. Leaving list on the registry default looks like an intentional scope boundary; flag only so it's a conscious decision. (This is what CodeRabbit's "consistency across status paths" comment is pointing at.)
  • The new tests (index.test.ts:645, services.test.ts:162) don't snapshot/clear the higher-precedence env vars, so they could go flaky on a runner where NEMOCLAW_SANDBOX/NEMOCLAW_SANDBOX_NAME is set. CI is green today; worth a beforeEach/afterEach save-restore to harden them.

Neither blocks merge. Nice, well-scoped fix.

@Thabhelo
Copy link
Copy Markdown
Contributor Author

Thabhelo commented Jun 5, 2026

Thanks @prekshivyas for the review and approval, glad the fix lines up with #1077.

I've noted the two non-blocking items.

Will address after this is merged.

@cv cv merged commit 751459e into NVIDIA:main Jun 5, 2026
18 checks passed
cv pushed a commit that referenced this pull request Jun 5, 2026
## Summary
Follow-up to #4756 / #1077: `nemoclaw list` and inventory JSON now
resolve the default sandbox from `SANDBOX_NAME` /
`NEMOCLAW_SANDBOX_NAME` / `NEMOCLAW_SANDBOX`, matching `status`,
`start`, and `stop`. Also consolidates #1077 test env isolation into
shared `beforeEach`/`afterEach` hooks.

## Related Issue
Related to #1077 (follow-up to merged #4756)

## Changes
- Wire `resolveDefaultSandboxName()` into `getSandboxInventory()` for
`defaultSandbox` and `isDefault` rows
- Add list/inventory tests for env-resolved default sandbox marker (`*`)
- Refactor `#1077` inventory tests into a nested describe with shared
env save/restore
- Add `beforeEach` env clearing in `services.test.ts` `#1077` describe
block

## Type of Change
- [x] Code change (feature, bug fix, or refactor)
- [ ] Code change with doc updates
- [ ] Doc only (prose changes, no code sample modifications)
- [ ] Doc only (includes code sample changes)

## Verification
- [ ] `npx prek run --all-files` passes
- [x] `npm test` passes (targeted: `src/lib/inventory/index.test.ts`,
`src/lib/tunnel/services.test.ts`, including `-t '#1077'`)
- [x] Tests added or updated for new or changed behavior
- [x] No secrets, API keys, or credentials committed
- [ ] Docs updated for user-facing behavior changes
- [ ] `npm run docs` builds without warnings (doc changes only)
- [ ] Doc pages follow the [style
guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md)
(doc changes only)
- [ ] New doc pages include SPDX header and frontmatter (new pages only)

---
Signed-off-by: Thabhelo <50872400+Thabhelo@users.noreply.github.com>

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Tests**
* Improved test organization and environment variable cleanup for better
test reliability.

* **Refactor**
  * Enhanced internal sandbox resolution logic for improved consistency.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Thabhelo <50872400+Thabhelo@users.noreply.github.com>
miyoungc added a commit that referenced this pull request Jun 6, 2026
## Summary
- Adds the `v0.0.60` section to `docs/about/release-notes.mdx` using the
dev announcement from discussion #4877.
- Fills the source-doc gaps found during release-prep review across
inference, policy tiers, command behavior, security boundaries, Hermes
dashboard/tooling, runtime context, and troubleshooting.
- Refreshes generated agent skills under `.agents/skills/` from the
current Fern docs output and upgrades Fern from `5.44.3` to `5.45.0`.

## Source summary
- #4037 -> `docs/reference/architecture.mdx`,
`docs/about/how-it-works.mdx`, `docs/about/release-notes.mdx`: Documents
system-only runtime context that stays out of visible chat.
- #4875 -> `docs/reference/architecture.mdx`,
`docs/about/how-it-works.mdx`, `docs/about/release-notes.mdx`: Documents
try-first sandbox network/filesystem guidance and clearer failure
classification.
- #4788 -> `docs/security/best-practices.mdx`,
`docs/about/release-notes.mdx`: Documents shared OpenClaw
device-approval policy for startup and connect.
- #4768 -> `docs/reference/network-policies.mdx`,
`docs/network-policy/integration-policy-examples.mdx`,
`docs/get-started/quickstart.mdx`,
`docs/get-started/quickstart-hermes.mdx`, `docs/reference/commands.mdx`:
Documents `weather`, `public-reference`, and Hermes managed-tool gateway
preset behavior.
- #3788 and #4864 -> `docs/reference/network-policies.mdx`,
`docs/reference/commands.mdx`: Documents non-interactive policy-tier
fail-fast behavior and interactive prompt fallback.
- #4756 and #4866 -> `docs/reference/commands.mdx`: Documents env-aware
default sandbox resolution for `list`, `status`, and `tunnel` commands.
- #4320 -> `docs/reference/commands.mdx`: Documents `$$nemoclaw tunnel
status` behavior.
- #4328 -> `docs/reference/commands.mdx`: Documents line-scoped policy
preset descriptions in `policy-list`.
- #4580 and #4748 -> `docs/reference/architecture.mdx`: Documents
package-managed OpenShell gateway service and Docker-driver
gateway-marker behavior.
- #4598 -> `docs/manage-sandboxes/lifecycle.mdx`: Documents concurrent
gateway/dashboard cleanup isolation by sandbox name and port.
- #4777 -> `docs/reference/troubleshooting.mdx`: Documents Docker GPU
patch rollback behavior.
- #4610 -> `docs/reference/troubleshooting.mdx`,
`docs/reference/commands.mdx`: Keeps mutable OpenClaw config permission
guidance aligned and removes skipped experimental wording.
- #4868 -> `docs/reference/commands.mdx`: Keeps `.dockerignore` handling
for custom `onboard --from <Dockerfile>` contexts in generated skills.
- #4870 -> `docs/reference/commands.mdx`,
`docs/manage-sandboxes/runtime-controls.mdx`: Documents
`NEMOCLAW_MINIMAL_BOOTSTRAP` and generated skill coverage.
- #4641 -> `docs/inference/inference-options.mdx`,
`docs/reference/troubleshooting.mdx`: Documents local NVIDIA NIM
platform-digest pulls and served-model id adoption.
- #4810 and #4867 -> `docs/inference/inference-options.mdx`: Documents
stable NGC managed-vLLM image lineage and DGX Station DeepSeek V4 Flash
coverage.
- #4852 -> `docs/inference/use-local-inference.mdx`,
`docs/reference/troubleshooting.mdx`: Documents Ollama model fit
filtering, 16K context floor, cold-load retry, and failed-model
exclusion.
- #4847 -> `docs/inference/switch-inference-providers.mdx`: Documents
API-family sync, Hermes `api_mode`, and Bedrock Runtime exception.
- #4800 -> `docs/inference/tool-calling-reliability.mdx`: Documents
Nemotron managed-inference native tool-search fallback.
- #4333 -> `docs/inference/switch-inference-providers.mdx`: Documents
interactive multimodal input prompting.
- #4086 -> `docs/reference/troubleshooting.mdx`: Keeps proxy bypass
normalization in generated troubleshooting coverage.
- #4811 and #4855 -> `docs/get-started/quickstart-hermes.mdx`: Documents
prebuilt Hermes dashboard assets and TUI recovery without runtime
rebuilds.
- #4854 -> `docs/inference/switch-inference-providers.mdx`,
`docs/reference/commands.mdx`: Documents Hermes proxy API-key
placeholder preservation during inference switches.
- #4248 -> `docs/manage-sandboxes/messaging-channels.mdx`,
`.agents/skills/`: Keeps messaging enrollment behavior aligned with
manifest-hook implementation.
- #4771 -> `docs/security/best-practices.mdx`,
`docs/security/credential-storage.mdx`: Documents Hermes
placeholder-only secret boundary for sandbox-visible runtime files.
- #4787 -> `docs/security/best-practices.mdx`,
`docs/about/release-notes.mdx`: Documents expanded memory scanner
examples for OpenAI project keys and Slack app-level tokens.
- #4848 -> `docs/reference/commands.mdx`: Documents OpenClaw skill
install mirroring into the agent home directory.
- #4790 -> `docs/about/release-notes.mdx`: Uses the prior release-prep
structure and generated `.agents/skills/` refresh as the template for
this release.

## Verification
- `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix
nemoclaw-user --doc-platform fern-mdx`
- `python3 scripts/docs-to-skills.py docs/ .agents/skills/ skills/
--prefix nemoclaw-user --doc-platform fern-mdx --dry-run`
- `npm run docs`
- `git diff --check`
- skip-term scan across `docs/`, `.agents/skills/`, and `skills/`
- `npm run build:cli`
- `npm run typecheck:cli`
- Commit and pre-push hook suites, including markdownlint, gitleaks,
env-var docs gate, docs-to-skills verification, and skills YAML tests

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

## Release Notes

* **New Features**
* DeepSeek-V4-Flash now available as default inference model for DGX
Station.
* Hermes dashboard improved with dedicated port and OAuth-authenticated
tool gateway selection.
* Added weather and public-reference policy presets for expanded agent
capabilities.
* Enhanced Ollama model selection with GPU memory filtering and
automatic retry for timeouts.

* **Bug Fixes**
  * Improved policy tier validation to prevent invalid configurations.
* Better sandbox cleanup scoping by port to prevent conflicts across
deployments.
  * Added GPU patch failure recovery with automatic rollback.

* **Documentation**
* Expanded troubleshooting guides for inference, security, and sandbox
lifecycle.
  * Added .dockerignore best practices for custom deployments.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Carlos Villela <cvillela@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: cli Command line interface, flags, terminal UX, or output bug-fix PR fixes a bug or regression v0.0.60 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Telegram-Bridge Fails to start

4 participants