Skip to content

refactor(cli): group agent dashboard diagnostics and shields modules#3191

Merged
cv merged 10 commits into
mainfrom
refactor/lib-feature-clusters
May 7, 2026
Merged

refactor(cli): group agent dashboard diagnostics and shields modules#3191
cv merged 10 commits into
mainfrom
refactor/lib-feature-clusters

Conversation

@cv
Copy link
Copy Markdown
Collaborator

@cv cv commented May 7, 2026

Summary

Moves the first low-risk feature clusters out of the flat src/lib namespace. Agent, dashboard, diagnostics, and shields helpers now live in feature folders that match the placement map introduced by the prior PR.

Changes

  • Move agent definition/onboard/runtime helpers under src/lib/agent/** and update bin shims/imports.
  • Move dashboard contract, health, and recovery helpers under src/lib/dashboard/**.
  • Move debug collection and debug command parsing helpers under src/lib/diagnostics/**.
  • Move shields implementation, timer, audit helper, and related tests under src/lib/shields/**.
  • Update imports, source-shape tests, CodeRabbit path instructions, and legacy migration helper mappings for the new paths.

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • npx prek run --all-files passes
  • npm test passes
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes
  • make docs builds without warnings (doc changes only)
  • Doc pages follow the style guide (doc changes only)
  • New doc pages include SPDX header and frontmatter (new pages only)

Signed-off-by: Carlos Villela cvillela@nvidia.com

Summary by CodeRabbit

  • Refactor
    • Reorganized internal module structure to improve code organization and maintainability. Updated import paths across the codebase to reflect new module locations. No changes to user-facing features or functionality.

@cv cv self-assigned this May 7, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 7, 2026

Review Change Stack

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough

Walkthrough

This PR reorganizes module structure by consolidating agent, dashboard, debug, and shields modules from flat files into organized subdirectories (agent/, dashboard/, diagnostics/, shields/). All import paths, re-export shims, and test imports are updated to reflect the new structure. No functional logic changes occur.

Changes

Module Reorganization: Directory Structure Consolidation

Layer / File(s) Summary
Migration Configuration & Scripts
.coderabbit.yaml, scripts/check-legacy-migrated-paths.ts, scripts/ts-migration-assist.ts
Configuration and scripts updated to define new module paths and migration rules for agent, dashboard, and debug modules.
Re-export Shim Updates
bin/lib/agent-defs.js, bin/lib/agent-onboard.js, bin/lib/agent-runtime.js
CommonJS shims updated to target new compiled paths: dist/lib/agent/defs, dist/lib/agent/onboard, dist/lib/agent/runtime.
Agent Module Source Files
src/lib/agent/defs.ts, src/lib/agent/onboard.ts, src/lib/agent/runtime.ts
Agent module sources reorganized to use new relative import paths for dependencies within agent/ subdirectory structure.
Agent Module Test Updates
src/lib/agent/base-image.test.ts, src/lib/agent/defs.test.ts, src/lib/agent/onboard.test.ts, src/lib/agent/runtime.test.ts
Agent test files updated to import from new dist/lib/agent/ compiled module paths.
Agent Module Consumer Imports
src/lib/actions/onboard.ts, src/lib/actions/sandbox/rebuild.ts, src/lib/policies.ts, src/lib/sandbox-version.ts, src/lib/state/sandbox.ts
Source files consuming agent modules updated to import from new agent/defs and agent/onboard paths.
Dashboard Module Path Updates
src/lib/dashboard/contract.ts, src/lib/dashboard/health.ts, src/lib/dashboard/recover.ts
Dashboard modules consolidated under dashboard/contract, dashboard/health, dashboard/recover with all internal imports updated.
Dashboard Module Test Updates
src/lib/dashboard/contract.test.ts, src/lib/dashboard/health.test.ts, src/lib/dashboard/recover.test.ts
Dashboard test files updated to import from new dist/lib/dashboard/ compiled module paths.
Debug/Diagnostics Module Reorganization
src/lib/diagnostics/debug.ts, src/lib/commands/debug.ts
Debug modules relocated to diagnostics/ subdirectory with repoDir traversal adjusted and all imports updated.
Debug/Diagnostics Test Updates
src/lib/diagnostics/debug-command.test.ts, src/lib/diagnostics/debug.test.ts, src/lib/commands/simple-global-oclif-adapters.test.ts
Debug test files updated to import from new dist/lib/diagnostics/ compiled module paths.
Shields Module Reorganization
src/lib/shields/audit.ts, src/lib/shields/index.ts, src/lib/shields/timer.ts
Shields module split into subdirectory with audit.ts and timer.ts as separate modules; imports updated to parent-relative paths.
Shields Module Test Updates
src/lib/shields/index.test.ts
Test mocks updated to resolve new shields module structure with corrected relative import paths and fixture locations.
Integration & Dashboard Consumer Imports
src/lib/onboard.ts, src/lib/sandbox-config.ts, src/lib/verify-deployment.ts
Integration files updated to import dashboard and agent modules from new dashboard/contract and agent/defs paths.
Comprehensive Test Suite Updates
test/sandbox-version.test.ts, test/verify-deployment.test.ts, test/config-set-nested-ssrf.test.ts, test/onboard.test.ts, test/repro-2681-group-writable.test.ts, test/secret-redaction.test.ts
Test suite updated with new compiled module paths for agent, dashboard, shields, and diagnostics across all integration scenarios.

🎯 2 (Simple) | ⏱️ ~10 minutes

🐰 A rabbit hops through the code so fine,
Organizing modules in a neat design,
Agent, dashboard, shields all in place,
No logic changed, just structure to embrace!
Imports rewired with careful grace. 🌿

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: refactoring to reorganize agent, dashboard, diagnostics, and shields modules into feature-specific folders within src/lib.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch refactor/lib-feature-clusters

Comment @coderabbitai help to get the list of available commands and usage tips.

@cv cv marked this pull request as draft May 7, 2026 16:55
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 7, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@cv cv added the v0.0.37 Release target label May 7, 2026
@cv cv changed the base branch from refactor/lib-placement-map to main May 7, 2026 18:47
@cv cv marked this pull request as ready for review May 7, 2026 18:47
@cv cv requested a review from ericksoa May 7, 2026 18:47
@cv cv requested a review from jyaunches May 7, 2026 18:48
@cv cv requested review from cjagwani, jyaunches and prekshivyas May 7, 2026 18:48
Copy link
Copy Markdown
Contributor

@prekshivyas prekshivyas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mechanical move-only refactor implementing step 2 of the #3189 migration sequence. 23 of 44 files are git-detected renames (similarity 87-100%); the other 21 are import-path-only updates in callers and tooling.

Spot-checked the following for non-mechanical drift and found none:

  • src/lib/agent/onboard.ts: identifier-by-identifier identical to the old agent-onboard.ts; only changes are sibling/parent relative-path adjustments (./agent-defs./defs, ./runner../runner, ./adapters/docker../adapters/docker).
  • src/lib/agent/runtime.ts: same pattern, no logic changes.
  • src/lib/shields/index.test.ts (largest delta at 12+/18-): pure path adjustments — vi.mock(\"../../src/lib/runner\")vi.mock(\"../runner\"), dynamic import(\"../src/lib/domain/duration.js\")import(\"../domain/duration.js\"), path.join(..., \"src\", \"lib\", \"shields.ts\")path.join(..., \"index.ts\"). The net -6 is just simpler path.join calls now that the test sits next to its source.

Tooling correctly updated alongside the moves:

  • .coderabbit.yaml glob src/lib/shields*.tssrc/lib/shields/**.
  • scripts/check-legacy-migrated-paths.ts REMOVED_SHIM_MOVES entry for bin/lib/debug.js retargeted to src/lib/diagnostics/debug.ts.
  • scripts/ts-migration-assist.ts SPECIAL_REWRITES entries for debug paths now point at src/lib/diagnostics/debug-command.
  • bin/lib/agent-{defs,onboard,runtime}.js re-export shims retargeted at dist/lib/agent/... so external callers via the bin/lib entrypoints still work.
  • All in-tree callers (src/lib/onboard.ts, src/lib/policies.ts, src/lib/sandbox-config.ts, src/lib/state/sandbox.ts, src/lib/actions/onboard.ts, src/lib/actions/sandbox/rebuild.ts, src/lib/commands/debug.ts, src/lib/verify-deployment.ts, src/lib/sandbox-version.ts, multiple test files) updated symmetrically.

Tests moved alongside their source (test/shields.test.tssrc/lib/shields/index.test.ts, test/shields-audit.test.tssrc/lib/shields/audit.test.ts), consistent with the colocation pattern in the placement map.

CI: pr.yaml rollup checks pass (commit-lint, dco, layer-boundary, check-hash, changes, get-pr-info, block edits to migrated legacy paths and removed shims). 1 prior pr-self-hosted run success on pull-request/3191. CodeRabbit / checks / macos-e2e / wsl-e2e / build-sandbox-images and current self-hosted run still in progress at approval time.

@cv cv merged commit 794beaf into main May 7, 2026
20 of 21 checks passed
jyaunches pushed a commit that referenced this pull request May 8, 2026
## Summary
- Bump the docs release metadata to `0.0.37`.
- Document release-prep updates for messaging policy presets, sandbox
runtime utilities, and the GPU CDI troubleshooting path.
- Refresh generated `nemoclaw-user-*` skills from the updated docs.

## Source summary
- #3159 -> `docs/reference/troubleshooting.md`: Documents the GPU CDI
preflight warning and remediation for `nvidia.com/gpu=all` gateway start
failures.
- #2415 -> `docs/reference/network-policies.md`,
`docs/manage-sandboxes/messaging-channels.md`,
`docs/network-policy/customize-network-policy.md`: Clarifies that
Telegram, Discord, and Slack egress comes from opt-in messaging presets,
not the baseline policy.
- #3091 -> `docs/deployment/sandbox-hardening.md`,
`docs/network-policy/customize-network-policy.md`: Documents the
retained sandbox utilities `vi`, `jq`, and `dos2unix` while keeping
host-side policy files as the durable source of truth.

## Test plan
- `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix
nemoclaw-user`
- `make docs`
- `npm run build:cli`
- `npm run typecheck:cli`
- Commit and pre-push hooks: markdownlint, docs-to-skills verification,
gitleaks, commitlint, CLI typecheck

## Skipped
- #3193 and #3191 matched `docs/.docs-skip` entries for experimental
shields/config paths.
- #3200 and #3183 were test-only fixes.
- #3189 and #3163 were internal documentation/refactor changes with no
public docs impact.

Made with [Cursor](https://cursor.com)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Documentation**
* Clarified which utilities remain in the sandbox runtime for
lightweight inspection and cleanup
* Noted that messaging endpoints (Discord, Slack, Telegram) are not in
the baseline policy and that channel presets are applied during
onboarding
  * Added GPU passthrough troubleshooting for gateway startup
* Updated release/version bump and release-prep workflow guidance,
including Discord preset description updates
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
cv added a commit that referenced this pull request May 15, 2026
…#3409)

## Summary

- Wraps the terminal `wait "\$GATEWAY_PID"` in
`scripts/nemoclaw-start.sh` (both non-root and root/step-down branches)
in a respawn loop so unexpected gateway death no longer drops PID 1 and
reaps the sandbox container.
- Adds a 60s-window respawn-count guard: after 5 respawns in <60s, logs
a `CRITICAL` line so a crashing gateway surfaces in `/tmp/gateway.log`
rather than being masked.
- Preserves existing `cleanup_on_signal` shutdown semantics — clean exit
(rc=0) still drops PID 1, SIGTERM/SIGINT still trigger the existing
handler.

Closes #2757.

## Root cause

The bug report blamed `src/lib/agent-runtime.ts` for missing supervisor
logic, but that file was moved to `src/lib/agent/runtime.ts` (#3191) and
the gateway launch is correct — `nohup ... &` followed by `wait
"\$GATEWAY_PID"`. The real cause sits one layer down: `wait` unblocks
the moment the gateway dies, PID 1 exits, and Docker reaps the container
by design (`scripts/nemoclaw-start.sh` is the entrypoint). NemoClaw also
doesn't pass `--restart=` when OpenShell creates the sandbox, so neither
layer recovers.

## Verification

Reproduced locally in Ubuntu 24.04 via a synthetic entrypoint mirroring
lines 2240-2268 of this file:

| Test | Result |
|---|---|
| Without patch, `kill -9 \$GATEWAY_PID` | Container `exited`
(exitCode=137, restartCount=0). Matches QA report. |
| With patch + same kill | Loop sees rc=137, sleeps 2s, relaunches.
Container stays `running`, gateway gets new PID. `nemoclaw status` →
healthy. |

## Type of Change
- [x] Code change (feature, bug fix, or refactor)
- [ ] Code change with doc updates
- [ ] Doc only (prose changes, no code sample modifications)
- [ ] Doc only (includes code sample changes)

## Verification
- [x] \`npx prek run --all-files\` passes (shellcheck clean on the
touched file)
- [ ] \`npm test\` passes (no JS/TS touched; not run)
- [ ] Tests added or updated for new or changed behavior — see below
- [x] No secrets, API keys, or credentials committed
- [ ] Docs updated for user-facing behavior changes (n/a — internal
entrypoint behavior)
- [ ] \`make docs\` builds without warnings (doc changes only)

## Test plan

Manual repro mirrors the QA acceptance criteria:

1. \`nemoclaw onboard --name my-assistant --non-interactive\`
2. \`docker exec <sandbox-container> pgrep -af "openclaw gateway"\` →
note PID
3. \`docker exec <sandbox-container> kill -9 <pid>\`
4. Wait 5s
5. \`nemoclaw my-assistant status\` → expect HEALTHY (no \`connect\`
needed)

I did **not** add an automated E2E test for the kill-and-respawn flow in
this PR (scope kept minimal per #2757's acceptance criteria); happy to
follow up with one if reviewers want — would slot into
\`test/e2e/test-sandbox-survival.sh\`.

## Notes for reviewers

- Both branches of the entrypoint (non-root at L2021, root/step-down at
L2240) get the same loop. The root branch uses
\`"\${STEP_DOWN_PREFIX_GATEWAY[@]}"\` to preserve the gateway-user UID
separation on respawn.
- \`SANDBOX_WAIT_PID\` is reassigned on each respawn so
\`cleanup_on_signal\` (in \`scripts/lib/sandbox-init.sh\`) waits on the
live PID during shutdown.
- \`SANDBOX_CHILD_PIDS\` accumulates respawn PIDs; the trap kills them
all with \`2>/dev/null || true\` so stale entries don't break shutdown.
- Tier-3 follow-up (have \`nemoclaw status\` also call
\`checkAndRecoverSandboxProcesses\`, currently only \`connect\` does) is
logged as a separate quick-win — not in this PR's scope.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Bug Fixes**
* Gateway service now auto-restarts if it exits unexpectedly, improving
availability and reducing manual intervention.
* Added safeguards and enhanced logging to detect and emit a critical
alert when frequent restart attempts occur within a short window,
preventing runaway restart loops.

<!-- review_stack_entry_start -->

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3409)

<!-- review_stack_entry_end -->
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Julie Yaunches <jyaunches@nvidia.com>
Co-authored-by: Carlos Villela <cvillela@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

v0.0.37 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants