Skip to content

fix(onboard): recover from SSH 255 when sandbox was already created#1598

Merged
ericksoa merged 1 commit intomainfrom
fix/sandbox-create-ssh255-recovery
Apr 8, 2026
Merged

fix(onboard): recover from SSH 255 when sandbox was already created#1598
ericksoa merged 1 commit intomainfrom
fix/sandbox-create-ssh255-recovery

Conversation

@ericksoa
Copy link
Copy Markdown
Contributor

@ericksoa ericksoa commented Apr 8, 2026

Summary

  • When openshell sandbox create exits with SSH 255 after printing "Created sandbox:", NemoClaw now treats this as recoverable instead of hard-exiting
  • Added a final readyCheck in streamSandboxCreate on non-zero close to catch the race where the sandbox is already Ready when SSH dies
  • In onboard.js, sandbox_create_incomplete failures now fall through to the existing 60s ready-wait loop instead of calling process.exit()

Root cause

Regression from #1516 (create-stream extraction) combined with #1081/#1527 (messaging provider migration forcing sandbox recreation). The create stream returns non-zero after "Created sandbox:" and NemoClaw exits before checking if the sandbox reached Ready state.

Test plan

  • New unit tests: stream recovers when readyCheck is true at close time (SSH 255)
  • New unit test: stream still returns non-zero when readyCheck is false at close time
  • All existing sandbox-create-stream tests pass (7/7)
  • All onboard integration tests pass (83/83)
  • All src unit tests pass (538/538)
  • Manual: trigger SSH 255 during sandbox create and verify recovery

Summary by CodeRabbit

  • Bug Fixes
    • Enhanced sandbox creation error handling with improved failure classification and recovery messaging.
    • Improved sandbox readiness detection to prevent unnecessary creation failures when the sandbox is operational.

When openshell sandbox create exits with SSH 255 after printing
"Created sandbox:", NemoClaw previously hard-exited instead of checking
whether the sandbox reached Ready state. This was a regression from
the create-stream extraction (#1516) combined with the messaging
provider migration path (#1081, #1527) that forces sandbox recreation.

Two fixes:
- streamSandboxCreate: do one final readyCheck on non-zero close to
  catch the race where the sandbox is already Ready when SSH dies.
- onboard.js: when failure is sandbox_create_incomplete, fall through
  to the existing ready-wait loop (60s polling) instead of exiting.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 8, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: aa92ed04-1fd8-40d9-81a6-e30f370b5bcc

📥 Commits

Reviewing files that changed from the base of the PR and between 4c38bfa and 48b39fb.

📒 Files selected for processing (3)
  • bin/lib/onboard.js
  • src/lib/sandbox-create-stream.test.ts
  • src/lib/sandbox-create-stream.ts

📝 Walkthrough

Walkthrough

This PR enhances sandbox creation error handling in two places: the onboard orchestrator now classifies creation failures and conditionally continues or exits, while the create-stream module adds a final readiness check when the child process exits with a non-zero code to override the error if the sandbox is actually ready.

Changes

Cohort / File(s) Summary
Sandbox Creation Error Recovery
bin/lib/onboard.js, src/lib/sandbox-create-stream.ts
Enhanced error handling with failure classification and readiness checks. Onboard now classifies creation failures and continues on sandbox_create_incomplete. Create-stream adds a synchronous final readiness check on non-zero child process exit to override errors when sandbox is already functional.
Creation Stream Tests
src/lib/sandbox-create-stream.test.ts
Added two test cases covering child process close behavior: one where readiness check succeeds before close (overrides non-zero exit), another where readiness check fails (preserves non-zero exit status).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Poem

🐰 A sandbox born, but stumbling fast,
Yet readiness checked—it's here at last!
When child processes exit with a frown,
We check one more time before we're down.
Recovery blooms where errors stood before! 🌱

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly references the main fix: recovering from SSH exit code 255 when a sandbox was already created, which is the core objective of this PR.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/sandbox-create-ssh255-recovery

Comment @coderabbitai help to get the list of available commands and usage tips.

@ericksoa ericksoa merged commit adbea05 into main Apr 8, 2026
9 checks passed
@ericksoa ericksoa deleted the fix/sandbox-create-ssh255-recovery branch April 8, 2026 03:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants