fix(ci): revert #2186 and use direct symlink for nemoclaw CLI on Brev#2196
fix(ci): revert #2186 and use direct symlink for nemoclaw CLI on Brev#2196
Conversation
PR #2186 tried to unblock non-full Brev E2E suites by moving the slow `sudo npm link` step from the test (120s cap) into the launchable setup (20min outer cap). That hit a worse pathology: on cold CPU Brev, `sudo npm link` + `sudo chown -R` on a ~50k-file node_modules tree doesn't complete within 20 minutes at all. Every run now hangs at "Linking nemoclaw CLI globally..." until the outer cap trips — 10× longer to fail and 10× more Brev credit per failed run. The previous commit reverts #2186. This commit replaces `sudo npm link` with what npm link actually produces: two symlinks. We can do them directly with `sudo ln -sf` and skip npm's global-prefix housekeeping entirely. O(1), no chown traversal, no hang. Changes: - scripts/brev-launchable-ci-cpu.sh: after plugin build, create /usr/local/bin/nemoclaw → $NEMOCLAW_CLONE_DIR/bin/nemoclaw.js with a direct `sudo ln -sf`. Drop the `sudo chown -R node_modules` that used to pair with npm link (no longer needed — only one file is owned by root now). - test/e2e/brev-e2e.test.ts: replace the in-test `sudo npm link` with the same direct-symlink approach. Launchable already pre-links on the same path; this is idempotent re-link so local dev runs that skip the launchable still work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughCI script, e2e test, and workflow changes: the CI/test now build the NemoClaw CLI ( Changes
Sequence Diagram(s)sequenceDiagram
participant CI as CI Runner
participant GitHub as GitHub Actions
participant BrevVM as Brev VM (remote)
participant Artifact as Actions Artifact
CI->>BrevVM: rsync repo (excludes dist)
CI->>BrevVM: npm install --ignore-scripts
CI->>BrevVM: npm run build:cli
CI->>BrevVM: ln -sf <repo>/bin/nemoclaw.js /usr/local/bin/nemoclaw
CI->>BrevVM: sudo chmod +x /usr/local/bin/nemoclaw
GitHub->>CI: job fails (conditional)
CI->>BrevVM: refresh brevet auth + ssh, collect logs (/tmp/nemoclaw-onboard.log, openshell, docker ps)
BrevVM->>CI: scp /tmp/nc-debug.tar.gz -> brev-debug-bundle/
CI->>Artifact: upload brev-debug-bundle (artifact)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
❌ Brev E2E (credential-sanitization): FAILED on branch |
|
❌ Brev E2E (credential-sanitization): FAILED on branch
|
|
❌ Brev E2E (credential-sanitization): FAILED on branch
|
|
❌ Brev E2E (credential-sanitization): FAILED on branch
|
SSH'd into the keep_alive Brev instance from the last failed run and found the onboard log consisted entirely of: Error: Cannot find module '../dist/nemoclaw' ... code: 'MODULE_NOT_FOUND' Node.js v22.22.2 Root cause: the flow runs `npm install --ignore-scripts`, which skips the `prepare` lifecycle that normally invokes `build:cli`. Before PR #2186, `sudo npm link` implicitly triggered `prepare` via npm's lifecycle machinery and built `dist/` as a side effect. The direct `ln -sf` symlink this PR introduces does not — so `dist/` is never populated, `bin/nemoclaw.js`'s `require("../dist/nemoclaw")` crashes, onboard dies instantly, and the test polls a dead process for 20 min. Changes: - scripts/brev-launchable-ci-cpu.sh: run `npm run build:cli` explicitly after the `--ignore-scripts` root install. - test/e2e/brev-e2e.test.ts: same — after rsync'ing PR branch src (which excludes dist/), run `npm run build:cli` before invoking the CLI. - .github/workflows/e2e-brev.yaml: upload a debug bundle on failure (/tmp/nemoclaw-onboard.log, openshell sandbox list, docker ps, gateway status). Future failures will leave breadcrumbs without needing keep_alive + manual SSH. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>
|
❌ Brev E2E (credential-sanitization): FAILED on branch
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
test/e2e/brev-e2e.test.ts (1)
450-450: Quote the symlink target path for shell safetyThe current command is fine for standard
$HOME, but quoting the path makes this robust against whitespace/special chars in remote directory paths.Small robustness tweak
- `sudo ln -sf ${resolvedRemoteDir}/bin/nemoclaw.js /usr/local/bin/nemoclaw && sudo chmod +x ${resolvedRemoteDir}/bin/nemoclaw.js`, + `sudo ln -sf "${resolvedRemoteDir}/bin/nemoclaw.js" /usr/local/bin/nemoclaw && sudo chmod +x "${resolvedRemoteDir}/bin/nemoclaw.js"`,🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/brev-e2e.test.ts` at line 450, The shell command constructing the symlink uses ${resolvedRemoteDir}/bin/nemoclaw.js unquoted which can break if resolvedRemoteDir contains spaces or special chars; update the string (the command literal that currently reads `sudo ln -sf ${resolvedRemoteDir}/bin/nemoclaw.js /usr/local/bin/nemoclaw && sudo chmod +x ${resolvedRemoteDir}/bin/nemoclaw.js`) to quote the target path where resolvedRemoteDir is used (e.g., wrap ${resolvedRemoteDir}/bin/nemoclaw.js in quotes) so both the ln -sf and chmod parts safely handle paths with spaces or special characters.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/e2e-brev.yaml:
- Around line 233-246: Wrap each in-VM diagnostic command inside the ssh
multi-line string with a per-command timeout (using the timeout utility) so they
cannot hang; update the block that builds INSTANCE and runs ssh to prefix cp,
openshell sandbox list, openshell gateway status, docker ps -a and tar -C /tmp
-czf with e.g. "timeout 10s" or "timeout 30s" as appropriate (and fall back to
"|| true" to preserve existing behavior), ensuring the ssh command still uses
ConnectTimeout=10 but now also bounds each remote step so commands like
openshell, docker ps and tar cannot stall the job indefinitely.
---
Nitpick comments:
In `@test/e2e/brev-e2e.test.ts`:
- Line 450: The shell command constructing the symlink uses
${resolvedRemoteDir}/bin/nemoclaw.js unquoted which can break if
resolvedRemoteDir contains spaces or special chars; update the string (the
command literal that currently reads `sudo ln -sf
${resolvedRemoteDir}/bin/nemoclaw.js /usr/local/bin/nemoclaw && sudo chmod +x
${resolvedRemoteDir}/bin/nemoclaw.js`) to quote the target path where
resolvedRemoteDir is used (e.g., wrap ${resolvedRemoteDir}/bin/nemoclaw.js in
quotes) so both the ln -sf and chmod parts safely handle paths with spaces or
special characters.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 9b4e79cb-8e83-44ea-b7b3-a5f29c563394
📒 Files selected for processing (3)
.github/workflows/e2e-brev.yamlscripts/brev-launchable-ci-cpu.shtest/e2e/brev-e2e.test.ts
🚧 Files skipped from review as they are similar to previous changes (1)
- scripts/brev-launchable-ci-cpu.sh
|
✅ Brev E2E (credential-sanitization): PASSED on branch
|
|
✅ Validated end-to-end on Brev. Run 24746798821 — Ready for review. |
CodeRabbit review on #2196 flagged that the debug-bundle step could hang inside the SSH session if openshell/docker are in a bad state — precisely the case we need to survive since the step exists to diagnose pathological VM states. Wrap each service-touching command with `timeout`: - openshell sandbox list → 15s - openshell gateway status → 15s - docker ps -a → 15s - tar -czf → 30s Kept `cp` unwrapped (local-fs copy can't hang on a service). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>
…A/NemoClaw into fix/brev-cpu-npm-link-hang
Summary
Reverts #2186 (which made Brev E2E failures 10× slower) and replaces `sudo npm link` with a direct `sudo ln -sf` symlink in both the launchable setup and the in-test bootstrap path. `npm link` is overkill for what we actually need and hangs indefinitely on cold CPU Brev instances. Direct symlink is O(1) and deterministic.
Related Issue
Regression from my own #2186. Surfaced while validating #2183 (ollama E2E).
Changes
This PR contains two commits:
Evidence
Type of Change
Verification
Retro note
Shipped #2186 without running the Brev suite end-to-end — exact failure mode I had flagged on #2123 earlier the same day. This PR is validated the right way before merge.
AI Disclosure
Summary by CodeRabbit
Chores
Tests