fix(ci): move sudo npm link into launchable to unblock non-full Brev E2E#2186
fix(ci): move sudo npm link into launchable to unblock non-full Brev E2E#2186
Conversation
Every Brev E2E suite other than 'full' has been failing on main since 2026-04-16 (PR #1888) with: spawnSync /bin/sh ETIMEDOUT ... sudo npm link ... SIGTERM Root cause: #1888 added a `sudo npm link` step inside the test's bootstrapLaunchable() with a 120s execSync cap. On a cold CPU Brev instance, `npm link` on a fresh global prefix consistently takes 2-3 min while npm resolves and symlinks production deps for the first time. The `full` suite skips this step (uses install.sh instead), so that path was fine; every other suite (credential-sanitization, telegram-injection, messaging-providers, and any newly added suites) hits the timeout. Fix: do the cold `sudo npm link` during the launchable's setup phase, right after the plugin build. The launchable's readiness window has no tight per-step execSync timeout, so the slow first link happens there. When the test suite later rsyncs PR branch code over this clone and re-runs `sudo npm link`, npm sees the existing global symlink and finishes in seconds — well under the existing 120s cap. Also updates the code comment in the test so future readers know the cold-link cost has been relocated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThe CI startup script adds a setup step that creates a global Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
SSH'd into the keep_alive Brev instance from the last failed run and found the onboard log consisted entirely of: Error: Cannot find module '../dist/nemoclaw' ... code: 'MODULE_NOT_FOUND' Node.js v22.22.2 Root cause: the flow runs `npm install --ignore-scripts`, which skips the `prepare` lifecycle that normally invokes `build:cli`. Before PR #2186, `sudo npm link` implicitly triggered `prepare` via npm's lifecycle machinery and built `dist/` as a side effect. The direct `ln -sf` symlink this PR introduces does not — so `dist/` is never populated, `bin/nemoclaw.js`'s `require("../dist/nemoclaw")` crashes, onboard dies instantly, and the test polls a dead process for 20 min. Changes: - scripts/brev-launchable-ci-cpu.sh: run `npm run build:cli` explicitly after the `--ignore-scripts` root install. - test/e2e/brev-e2e.test.ts: same — after rsync'ing PR branch src (which excludes dist/), run `npm run build:cli` before invoking the CLI. - .github/workflows/e2e-brev.yaml: upload a debug bundle on failure (/tmp/nemoclaw-onboard.log, openshell sandbox list, docker ps, gateway status). Future failures will leave breadcrumbs without needing keep_alive + manual SSH. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>
…#2196) ## Summary Reverts #2186 (which made Brev E2E failures 10× slower) and replaces \`sudo npm link\` with a direct \`sudo ln -sf\` symlink in both the launchable setup and the in-test bootstrap path. \`npm link\` is overkill for what we actually need and hangs indefinitely on cold CPU Brev instances. Direct symlink is O(1) and deterministic. ## Related Issue Regression from my own #2186. Surfaced while validating #2183 (ollama E2E). ## Changes This PR contains two commits: 1. **Revert #2186.** Restores the launchable setup to its pre-\`sudo npm link\` state. This alone would leave non-full suites broken by the original pre-existing #1888 issue, but without burning a 20-min Brev instance per run. 2. **Replace \`sudo npm link\` with direct symlink.** - \`scripts/brev-launchable-ci-cpu.sh\` — after plugin build, create \`/usr/local/bin/nemoclaw → \$NEMOCLAW_CLONE_DIR/bin/nemoclaw.js\` via \`sudo ln -sf\`. Drop \`sudo chown -R\` (only the single symlink is root-owned now, not node_modules). - \`test/e2e/brev-e2e.test.ts\` — replace the in-test \`sudo npm link\` with the same direct-symlink approach. Idempotent re-link so local dev runs that skip the launchable still work. ## Evidence - **Before #2186 (2026-04-14):** non-full suites failed at ~2 min with a clear \`sudo npm link ETIMEDOUT\` (pre-existing #1888 regression, but cheap to fail). - **After #2186 (today):** non-full suites hang on \`sudo npm link\` inside the launchable until the 20-min outer cap trips. 10× longer to fail, 10× more Brev credit per failed run. Evidence: [run 24737205166](https://github.com/NVIDIA/NemoClaw/actions/runs/24737205166) — stuck on \`Linking nemoclaw CLI globally...\` from 808s to 1198s with no progress. - \`npm link\` does two things: \`/usr/local/lib/node_modules/<name>\` symlink + \`/usr/local/bin/<bin>\` symlink. We only need the second one. Replacing with \`sudo ln -sf\` bypasses npm's global-prefix housekeeping and the \`sudo chown -R node_modules\` traversal that likely drove the hang. ## Type of Change - [x] Code change (bug fix) ## Verification - [x] \`bash -n\` on the launchable script — syntax OK - [x] \`npm run typecheck:cli\` — passes - [x] Pre-push hooks — passes (modulo the pre-existing flaky \`test/install-preflight.test.ts:107\`, resolves on retry, unrelated) - [ ] **End-to-end Brev E2E run** — to be validated via \`gh workflow run e2e-brev.yaml --ref fix/brev-cpu-npm-link-hang --field test_suite=credential-sanitization\` (any non-\`full\` suite proves the bootstrap now completes). Will attach run URL before merge. ## Retro note Shipped #2186 without running the Brev suite end-to-end — exact failure mode I had flagged on #2123 earlier the same day. This PR is validated the right way before merge. ## AI Disclosure - [x] AI-assisted — tool: Claude Code <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Chores** * Made CLI installation deterministic and more reliable with explicit build and idempotent system-wide linking so the tool is consistently available and executable. * **Tests** * Hardened remote test setup and reduced a CLI-linking timeout for faster feedback. * On workflow failures, automatically collect VM diagnostics and upload them as a debug artifact for easier troubleshooting. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Charan Jagwani <cjagwani@nvidia.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Every Brev E2E suite other than `full` has been failing on `main` since 2026-04-16 with `spawnSync /bin/sh ETIMEDOUT ... sudo npm link ... SIGTERM`. Root cause is PR #1888, which added a `sudo npm link` step inside the test with a 120s `execSync` cap. On a cold CPU Brev instance, `npm link` on a fresh global prefix consistently takes 2-3 min — past the cap.
Related Issue
Regression from #1888. No dedicated issue filed — surfaced while working on #1924/PR #2183.
Changes
Evidence
Type of Change
Verification
Next step
Trigger the Brev E2E with any non-`full` suite (e.g. `test_suite=credential-sanitization`) to confirm the bootstrap completes. That's the one signal we need.
AI Disclosure
Summary by CodeRabbit