Skip to content

fix(ci): revert #2186 and use direct symlink for nemoclaw CLI on Brev#2196

Merged
cv merged 7 commits intomainfrom
fix/brev-cpu-npm-link-hang
Apr 21, 2026
Merged

fix(ci): revert #2186 and use direct symlink for nemoclaw CLI on Brev#2196
cv merged 7 commits intomainfrom
fix/brev-cpu-npm-link-hang

Conversation

@cjagwani
Copy link
Copy Markdown
Contributor

@cjagwani cjagwani commented Apr 21, 2026

Summary

Reverts #2186 (which made Brev E2E failures 10× slower) and replaces `sudo npm link` with a direct `sudo ln -sf` symlink in both the launchable setup and the in-test bootstrap path. `npm link` is overkill for what we actually need and hangs indefinitely on cold CPU Brev instances. Direct symlink is O(1) and deterministic.

Related Issue

Regression from my own #2186. Surfaced while validating #2183 (ollama E2E).

Changes

This PR contains two commits:

  1. Revert fix(ci): move sudo npm link into launchable to unblock non-full Brev E2E #2186. Restores the launchable setup to its pre-`sudo npm link` state. This alone would leave non-full suites broken by the original pre-existing chore(ci): extract helpers from brev-e2e beforeAll and fix worktree installer detection #1888 issue, but without burning a 20-min Brev instance per run.
  2. Replace `sudo npm link` with direct symlink.
    • `scripts/brev-launchable-ci-cpu.sh` — after plugin build, create `/usr/local/bin/nemoclaw → $NEMOCLAW_CLONE_DIR/bin/nemoclaw.js` via `sudo ln -sf`. Drop `sudo chown -R` (only the single symlink is root-owned now, not node_modules).
    • `test/e2e/brev-e2e.test.ts` — replace the in-test `sudo npm link` with the same direct-symlink approach. Idempotent re-link so local dev runs that skip the launchable still work.

Evidence

Type of Change

  • Code change (bug fix)

Verification

  • `bash -n` on the launchable script — syntax OK
  • `npm run typecheck:cli` — passes
  • Pre-push hooks — passes (modulo the pre-existing flaky `test/install-preflight.test.ts:107`, resolves on retry, unrelated)
  • End-to-end Brev E2E run — to be validated via `gh workflow run e2e-brev.yaml --ref fix/brev-cpu-npm-link-hang --field test_suite=credential-sanitization` (any non-`full` suite proves the bootstrap now completes). Will attach run URL before merge.

Retro note

Shipped #2186 without running the Brev suite end-to-end — exact failure mode I had flagged on #2123 earlier the same day. This PR is validated the right way before merge.

AI Disclosure

  • AI-assisted — tool: Claude Code

Summary by CodeRabbit

  • Chores

    • Made CLI installation deterministic and more reliable with explicit build and idempotent system-wide linking so the tool is consistently available and executable.
  • Tests

    • Hardened remote test setup and reduced a CLI-linking timeout for faster feedback.
    • On workflow failures, automatically collect VM diagnostics and upload them as a debug artifact for easier troubleshooting.

cjagwani and others added 2 commits April 21, 2026 11:11
PR #2186 tried to unblock non-full Brev E2E suites by moving the slow
`sudo npm link` step from the test (120s cap) into the launchable
setup (20min outer cap). That hit a worse pathology: on cold CPU Brev,
`sudo npm link` + `sudo chown -R` on a ~50k-file node_modules tree
doesn't complete within 20 minutes at all. Every run now hangs at
"Linking nemoclaw CLI globally..." until the outer cap trips — 10×
longer to fail and 10× more Brev credit per failed run.

The previous commit reverts #2186. This commit replaces `sudo npm
link` with what npm link actually produces: two symlinks. We can do
them directly with `sudo ln -sf` and skip npm's global-prefix
housekeeping entirely. O(1), no chown traversal, no hang.

Changes:
  - scripts/brev-launchable-ci-cpu.sh: after plugin build, create
    /usr/local/bin/nemoclaw → $NEMOCLAW_CLONE_DIR/bin/nemoclaw.js with
    a direct `sudo ln -sf`. Drop the `sudo chown -R node_modules` that
    used to pair with npm link (no longer needed — only one file is
    owned by root now).
  - test/e2e/brev-e2e.test.ts: replace the in-test `sudo npm link` with
    the same direct-symlink approach. Launchable already pre-links on
    the same path; this is idempotent re-link so local dev runs that
    skip the launchable still work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 21, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 42bc1775-0ef2-4f6e-8e9e-c7c1fc9053d6

📥 Commits

Reviewing files that changed from the base of the PR and between 3bb3812 and d2a0cfe.

📒 Files selected for processing (1)
  • .github/workflows/e2e-brev.yaml
🚧 Files skipped from review as they are similar to previous changes (1)
  • .github/workflows/e2e-brev.yaml

📝 Walkthrough

Walkthrough

CI script, e2e test, and workflow changes: the CI/test now build the NemoClaw CLI (npm run build:cli) after installing deps, replace sudo npm link + recursive chown with an idempotent symlink at /usr/local/bin/nemoclaw and chmod +x, reduce one SSH timeout, and add conditional Brev VM debug collection on workflow failure.

Changes

Cohort / File(s) Summary
CI / Launchable script
scripts/brev-launchable-ci-cpu.sh
Adds npm run build:cli after npm install --ignore-scripts; replaces sudo npm link + recursive chown with deterministic ln -sf symlink to /usr/local/bin/nemoclaw and chmod +x; updates log messages.
E2E test bootstrap
test/e2e/brev-e2e.test.ts
Runs npm run build:cli on remote before linking, replaces sudo npm link flow with force-symlink to /usr/local/bin/nemoclaw and sudo chmod +x; reduces SSH timeout for linking from 120000ms to 30000ms; updates logs.
Workflow: failure diagnostics
.github/workflows/e2e-brev.yaml
On job failure, performs best-effort Brev VM refresh + SSH to collect /tmp/nemoclaw-onboard.log, openshell sandbox list, openshell gateway status, docker ps -a; bundles /tmp/nc-debug.tar.gz, copies it back into brev-debug-bundle/, and uploads that artifact. Steps run only when failure() is true.

Sequence Diagram(s)

sequenceDiagram
    participant CI as CI Runner
    participant GitHub as GitHub Actions
    participant BrevVM as Brev VM (remote)
    participant Artifact as Actions Artifact

    CI->>BrevVM: rsync repo (excludes dist)
    CI->>BrevVM: npm install --ignore-scripts
    CI->>BrevVM: npm run build:cli
    CI->>BrevVM: ln -sf <repo>/bin/nemoclaw.js /usr/local/bin/nemoclaw
    CI->>BrevVM: sudo chmod +x /usr/local/bin/nemoclaw
    GitHub->>CI: job fails (conditional)
    CI->>BrevVM: refresh brevet auth + ssh, collect logs (/tmp/nemoclaw-onboard.log, openshell, docker ps)
    BrevVM->>CI: scp /tmp/nc-debug.tar.gz -> brev-debug-bundle/
    CI->>Artifact: upload brev-debug-bundle (artifact)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐇 I built the CLI and tied a neat knot,
A symlink shines where chowns are not.
If runs go sideways I hop in to pry,
I fetch all the logs and wave them goodbye. 🥕

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically summarizes the main change: reverting a prior approach (#2186) and replacing npm link with direct symlink for nemoclaw CLI on Brev instances.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/brev-cpu-npm-link-hang

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown

Brev E2E (credential-sanitization): FAILED on branch fix/brev-cpu-npm-link-hangSee logs

@github-actions
Copy link
Copy Markdown

Brev E2E (credential-sanitization): FAILED on branch fix/brev-cpu-npm-link-hangSee logs

Instance e2e-pr-2196 is still running. To SSH in:

brev refresh && ssh e2e-pr-2196

When done, delete it: brev delete e2e-pr-2196

@github-actions
Copy link
Copy Markdown

Brev E2E (credential-sanitization): FAILED on branch fix/brev-cpu-npm-link-hangSee logs

Instance e2e-pr-2196 is still running. To SSH in:

brev refresh && ssh e2e-pr-2196

When done, delete it: brev delete e2e-pr-2196

@github-actions
Copy link
Copy Markdown

Brev E2E (credential-sanitization): FAILED on branch fix/brev-cpu-npm-link-hangSee logs

Instance e2e-pr-2196 is still running. To SSH in:

brev refresh && ssh e2e-pr-2196

When done, delete it: brev delete e2e-pr-2196

SSH'd into the keep_alive Brev instance from the last failed run and
found the onboard log consisted entirely of:

  Error: Cannot find module '../dist/nemoclaw'
  ...
  code: 'MODULE_NOT_FOUND'
  Node.js v22.22.2

Root cause: the flow runs `npm install --ignore-scripts`, which skips
the `prepare` lifecycle that normally invokes `build:cli`. Before
PR #2186, `sudo npm link` implicitly triggered `prepare` via npm's
lifecycle machinery and built `dist/` as a side effect. The direct
`ln -sf` symlink this PR introduces does not — so `dist/` is never
populated, `bin/nemoclaw.js`'s `require("../dist/nemoclaw")` crashes,
onboard dies instantly, and the test polls a dead process for 20 min.

Changes:
  - scripts/brev-launchable-ci-cpu.sh: run `npm run build:cli`
    explicitly after the `--ignore-scripts` root install.
  - test/e2e/brev-e2e.test.ts: same — after rsync'ing PR branch src
    (which excludes dist/), run `npm run build:cli` before invoking
    the CLI.
  - .github/workflows/e2e-brev.yaml: upload a debug bundle on failure
    (/tmp/nemoclaw-onboard.log, openshell sandbox list, docker ps,
    gateway status). Future failures will leave breadcrumbs without
    needing keep_alive + manual SSH.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>
@github-actions
Copy link
Copy Markdown

Brev E2E (credential-sanitization): FAILED on branch fix/brev-cpu-npm-link-hangSee logs

Instance e2e-pr-2196 is still running. To SSH in:

brev refresh && ssh e2e-pr-2196

When done, delete it: brev delete e2e-pr-2196

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
test/e2e/brev-e2e.test.ts (1)

450-450: Quote the symlink target path for shell safety

The current command is fine for standard $HOME, but quoting the path makes this robust against whitespace/special chars in remote directory paths.

Small robustness tweak
-    `sudo ln -sf ${resolvedRemoteDir}/bin/nemoclaw.js /usr/local/bin/nemoclaw && sudo chmod +x ${resolvedRemoteDir}/bin/nemoclaw.js`,
+    `sudo ln -sf "${resolvedRemoteDir}/bin/nemoclaw.js" /usr/local/bin/nemoclaw && sudo chmod +x "${resolvedRemoteDir}/bin/nemoclaw.js"`,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/brev-e2e.test.ts` at line 450, The shell command constructing the
symlink uses ${resolvedRemoteDir}/bin/nemoclaw.js unquoted which can break if
resolvedRemoteDir contains spaces or special chars; update the string (the
command literal that currently reads `sudo ln -sf
${resolvedRemoteDir}/bin/nemoclaw.js /usr/local/bin/nemoclaw && sudo chmod +x
${resolvedRemoteDir}/bin/nemoclaw.js`) to quote the target path where
resolvedRemoteDir is used (e.g., wrap ${resolvedRemoteDir}/bin/nemoclaw.js in
quotes) so both the ln -sf and chmod parts safely handle paths with spaces or
special characters.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/e2e-brev.yaml:
- Around line 233-246: Wrap each in-VM diagnostic command inside the ssh
multi-line string with a per-command timeout (using the timeout utility) so they
cannot hang; update the block that builds INSTANCE and runs ssh to prefix cp,
openshell sandbox list, openshell gateway status, docker ps -a and tar -C /tmp
-czf with e.g. "timeout 10s" or "timeout 30s" as appropriate (and fall back to
"|| true" to preserve existing behavior), ensuring the ssh command still uses
ConnectTimeout=10 but now also bounds each remote step so commands like
openshell, docker ps and tar cannot stall the job indefinitely.

---

Nitpick comments:
In `@test/e2e/brev-e2e.test.ts`:
- Line 450: The shell command constructing the symlink uses
${resolvedRemoteDir}/bin/nemoclaw.js unquoted which can break if
resolvedRemoteDir contains spaces or special chars; update the string (the
command literal that currently reads `sudo ln -sf
${resolvedRemoteDir}/bin/nemoclaw.js /usr/local/bin/nemoclaw && sudo chmod +x
${resolvedRemoteDir}/bin/nemoclaw.js`) to quote the target path where
resolvedRemoteDir is used (e.g., wrap ${resolvedRemoteDir}/bin/nemoclaw.js in
quotes) so both the ln -sf and chmod parts safely handle paths with spaces or
special characters.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 9b4e79cb-8e83-44ea-b7b3-a5f29c563394

📥 Commits

Reviewing files that changed from the base of the PR and between 2eb5274 and 3bb3812.

📒 Files selected for processing (3)
  • .github/workflows/e2e-brev.yaml
  • scripts/brev-launchable-ci-cpu.sh
  • test/e2e/brev-e2e.test.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • scripts/brev-launchable-ci-cpu.sh

Comment thread .github/workflows/e2e-brev.yaml
@github-actions
Copy link
Copy Markdown

Brev E2E (credential-sanitization): PASSED on branch fix/brev-cpu-npm-link-hangSee logs

Instance e2e-pr-2196 is still running. To SSH in:

brev refresh && ssh e2e-pr-2196

When done, delete it: brev delete e2e-pr-2196

@cjagwani
Copy link
Copy Markdown
Contributor Author

Validated end-to-end on Brev. Run 24746798821test_suite=credential-sanitization passed on a fresh CPU Brev instance with this branch. Bootstrap (including the direct-symlink + explicit build:cli) completes cleanly, onboard succeeds, credential-sanitization suite passes.

Ready for review.

@cjagwani cjagwani self-assigned this Apr 21, 2026
@cjagwani cjagwani requested a review from cv April 21, 2026 21:38
@cjagwani cjagwani added security Something isn't secure CI/CD Use this label to identify issues with NemoClaw CI/CD pipeline or GitHub Actions. priority: high Important issue that should be resolved in the next release E2E End-to-end testing — Brev infrastructure, test cases, nightly failures, and coverage gaps labels Apr 21, 2026
@cjagwani cjagwani requested a review from jyaunches April 21, 2026 21:40
cjagwani and others added 2 commits April 21, 2026 14:46
CodeRabbit review on #2196 flagged that the debug-bundle step could
hang inside the SSH session if openshell/docker are in a bad state —
precisely the case we need to survive since the step exists to
diagnose pathological VM states.

Wrap each service-touching command with `timeout`:
  - openshell sandbox list   → 15s
  - openshell gateway status → 15s
  - docker ps -a             → 15s
  - tar -czf                 → 30s

Kept `cp` unwrapped (local-fs copy can't hang on a service).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>
@cv cv merged commit a0a9139 into main Apr 21, 2026
9 checks passed
@cv cv added the v0.0.22 Release target label Apr 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI/CD Use this label to identify issues with NemoClaw CI/CD pipeline or GitHub Actions. E2E End-to-end testing — Brev infrastructure, test cases, nightly failures, and coverage gaps priority: high Important issue that should be resolved in the next release security Something isn't secure v0.0.22 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants