refactor(cli): group agent dashboard diagnostics and shields modules#3191
Conversation
|
Caution Review failedPull request was closed or merged during review 📝 WalkthroughWalkthroughThis PR reorganizes module structure by consolidating agent, dashboard, debug, and shields modules from flat files into organized subdirectories (agent/, dashboard/, diagnostics/, shields/). All import paths, re-export shims, and test imports are updated to reflect the new structure. No functional logic changes occur. ChangesModule Reorganization: Directory Structure Consolidation
🎯 2 (Simple) | ⏱️ ~10 minutes
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
… refactor/lib-feature-clusters
prekshivyas
left a comment
There was a problem hiding this comment.
Mechanical move-only refactor implementing step 2 of the #3189 migration sequence. 23 of 44 files are git-detected renames (similarity 87-100%); the other 21 are import-path-only updates in callers and tooling.
Spot-checked the following for non-mechanical drift and found none:
src/lib/agent/onboard.ts: identifier-by-identifier identical to the oldagent-onboard.ts; only changes are sibling/parent relative-path adjustments (./agent-defs→./defs,./runner→../runner,./adapters/docker→../adapters/docker).src/lib/agent/runtime.ts: same pattern, no logic changes.src/lib/shields/index.test.ts(largest delta at 12+/18-): pure path adjustments —vi.mock(\"../../src/lib/runner\")→vi.mock(\"../runner\"), dynamicimport(\"../src/lib/domain/duration.js\")→import(\"../domain/duration.js\"),path.join(..., \"src\", \"lib\", \"shields.ts\")→path.join(..., \"index.ts\"). The net-6is just simplerpath.joincalls now that the test sits next to its source.
Tooling correctly updated alongside the moves:
.coderabbit.yamlglobsrc/lib/shields*.ts→src/lib/shields/**.scripts/check-legacy-migrated-paths.tsREMOVED_SHIM_MOVES entry forbin/lib/debug.jsretargeted tosrc/lib/diagnostics/debug.ts.scripts/ts-migration-assist.tsSPECIAL_REWRITES entries for debug paths now point atsrc/lib/diagnostics/debug-command.bin/lib/agent-{defs,onboard,runtime}.jsre-export shims retargeted atdist/lib/agent/...so external callers via thebin/libentrypoints still work.- All in-tree callers (
src/lib/onboard.ts,src/lib/policies.ts,src/lib/sandbox-config.ts,src/lib/state/sandbox.ts,src/lib/actions/onboard.ts,src/lib/actions/sandbox/rebuild.ts,src/lib/commands/debug.ts,src/lib/verify-deployment.ts,src/lib/sandbox-version.ts, multiple test files) updated symmetrically.
Tests moved alongside their source (test/shields.test.ts → src/lib/shields/index.test.ts, test/shields-audit.test.ts → src/lib/shields/audit.test.ts), consistent with the colocation pattern in the placement map.
CI: pr.yaml rollup checks pass (commit-lint, dco, layer-boundary, check-hash, changes, get-pr-info, block edits to migrated legacy paths and removed shims). 1 prior pr-self-hosted run success on pull-request/3191. CodeRabbit / checks / macos-e2e / wsl-e2e / build-sandbox-images and current self-hosted run still in progress at approval time.
## Summary - Bump the docs release metadata to `0.0.37`. - Document release-prep updates for messaging policy presets, sandbox runtime utilities, and the GPU CDI troubleshooting path. - Refresh generated `nemoclaw-user-*` skills from the updated docs. ## Source summary - #3159 -> `docs/reference/troubleshooting.md`: Documents the GPU CDI preflight warning and remediation for `nvidia.com/gpu=all` gateway start failures. - #2415 -> `docs/reference/network-policies.md`, `docs/manage-sandboxes/messaging-channels.md`, `docs/network-policy/customize-network-policy.md`: Clarifies that Telegram, Discord, and Slack egress comes from opt-in messaging presets, not the baseline policy. - #3091 -> `docs/deployment/sandbox-hardening.md`, `docs/network-policy/customize-network-policy.md`: Documents the retained sandbox utilities `vi`, `jq`, and `dos2unix` while keeping host-side policy files as the durable source of truth. ## Test plan - `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix nemoclaw-user` - `make docs` - `npm run build:cli` - `npm run typecheck:cli` - Commit and pre-push hooks: markdownlint, docs-to-skills verification, gitleaks, commitlint, CLI typecheck ## Skipped - #3193 and #3191 matched `docs/.docs-skip` entries for experimental shields/config paths. - #3200 and #3183 were test-only fixes. - #3189 and #3163 were internal documentation/refactor changes with no public docs impact. Made with [Cursor](https://cursor.com) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Clarified which utilities remain in the sandbox runtime for lightweight inspection and cleanup * Noted that messaging endpoints (Discord, Slack, Telegram) are not in the baseline policy and that channel presets are applied during onboarding * Added GPU passthrough troubleshooting for gateway startup * Updated release/version bump and release-prep workflow guidance, including Discord preset description updates <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Cursor <cursoragent@cursor.com>
…#3409) ## Summary - Wraps the terminal `wait "\$GATEWAY_PID"` in `scripts/nemoclaw-start.sh` (both non-root and root/step-down branches) in a respawn loop so unexpected gateway death no longer drops PID 1 and reaps the sandbox container. - Adds a 60s-window respawn-count guard: after 5 respawns in <60s, logs a `CRITICAL` line so a crashing gateway surfaces in `/tmp/gateway.log` rather than being masked. - Preserves existing `cleanup_on_signal` shutdown semantics — clean exit (rc=0) still drops PID 1, SIGTERM/SIGINT still trigger the existing handler. Closes #2757. ## Root cause The bug report blamed `src/lib/agent-runtime.ts` for missing supervisor logic, but that file was moved to `src/lib/agent/runtime.ts` (#3191) and the gateway launch is correct — `nohup ... &` followed by `wait "\$GATEWAY_PID"`. The real cause sits one layer down: `wait` unblocks the moment the gateway dies, PID 1 exits, and Docker reaps the container by design (`scripts/nemoclaw-start.sh` is the entrypoint). NemoClaw also doesn't pass `--restart=` when OpenShell creates the sandbox, so neither layer recovers. ## Verification Reproduced locally in Ubuntu 24.04 via a synthetic entrypoint mirroring lines 2240-2268 of this file: | Test | Result | |---|---| | Without patch, `kill -9 \$GATEWAY_PID` | Container `exited` (exitCode=137, restartCount=0). Matches QA report. | | With patch + same kill | Loop sees rc=137, sleeps 2s, relaunches. Container stays `running`, gateway gets new PID. `nemoclaw status` → healthy. | ## Type of Change - [x] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [ ] Doc only (includes code sample changes) ## Verification - [x] \`npx prek run --all-files\` passes (shellcheck clean on the touched file) - [ ] \`npm test\` passes (no JS/TS touched; not run) - [ ] Tests added or updated for new or changed behavior — see below - [x] No secrets, API keys, or credentials committed - [ ] Docs updated for user-facing behavior changes (n/a — internal entrypoint behavior) - [ ] \`make docs\` builds without warnings (doc changes only) ## Test plan Manual repro mirrors the QA acceptance criteria: 1. \`nemoclaw onboard --name my-assistant --non-interactive\` 2. \`docker exec <sandbox-container> pgrep -af "openclaw gateway"\` → note PID 3. \`docker exec <sandbox-container> kill -9 <pid>\` 4. Wait 5s 5. \`nemoclaw my-assistant status\` → expect HEALTHY (no \`connect\` needed) I did **not** add an automated E2E test for the kill-and-respawn flow in this PR (scope kept minimal per #2757's acceptance criteria); happy to follow up with one if reviewers want — would slot into \`test/e2e/test-sandbox-survival.sh\`. ## Notes for reviewers - Both branches of the entrypoint (non-root at L2021, root/step-down at L2240) get the same loop. The root branch uses \`"\${STEP_DOWN_PREFIX_GATEWAY[@]}"\` to preserve the gateway-user UID separation on respawn. - \`SANDBOX_WAIT_PID\` is reassigned on each respawn so \`cleanup_on_signal\` (in \`scripts/lib/sandbox-init.sh\`) waits on the live PID during shutdown. - \`SANDBOX_CHILD_PIDS\` accumulates respawn PIDs; the trap kills them all with \`2>/dev/null || true\` so stale entries don't break shutdown. - Tier-3 follow-up (have \`nemoclaw status\` also call \`checkAndRecoverSandboxProcesses\`, currently only \`connect\` does) is logged as a separate quick-win — not in this PR's scope. 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Gateway service now auto-restarts if it exits unexpectedly, improving availability and reducing manual intervention. * Added safeguards and enhanced logging to detect and emit a critical alert when frequent restart attempts occur within a short window, preventing runaway restart loops. <!-- review_stack_entry_start --> [](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3409) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Charan Jagwani <cjagwani@nvidia.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Julie Yaunches <jyaunches@nvidia.com> Co-authored-by: Carlos Villela <cvillela@nvidia.com>
Summary
Moves the first low-risk feature clusters out of the flat
src/libnamespace. Agent, dashboard, diagnostics, and shields helpers now live in feature folders that match the placement map introduced by the prior PR.Changes
src/lib/agent/**and update bin shims/imports.src/lib/dashboard/**.src/lib/diagnostics/**.src/lib/shields/**.Type of Change
Verification
npx prek run --all-filespassesnpm testpassesmake docsbuilds without warnings (doc changes only)Signed-off-by: Carlos Villela cvillela@nvidia.com
Summary by CodeRabbit