Skip to content

fix(cli): restore pre-#2398 gateway recovery (fixes E2E hangs)#2471

Merged
ericksoa merged 2 commits intomainfrom
revert/2398-dashboard-refactor
Apr 25, 2026
Merged

fix(cli): restore pre-#2398 gateway recovery (fixes E2E hangs)#2471
ericksoa merged 2 commits intomainfrom
revert/2398-dashboard-refactor

Conversation

@ericksoa
Copy link
Copy Markdown
Contributor

@ericksoa ericksoa commented Apr 25, 2026

Summary

Root cause

PR #2398 replaced the simple gateway recovery path with recoverDashboardChain() which introduced unbounded calls that hang in CI. The bisect confirmed #2398 as the sole culprit:

Run Commit Result
Last good de97a00d (Apr 24 16:06) pass
Bisect 4 9fbfbaca (#2398 only) hang
First bad 79c8e2a9 (Apr 25 00:10) hang

sandbox-survival, skip-permissions, sandbox-operations, and cloud-e2e all hang at nemoclaw <name> status after #2398.

What this reopens

Test plan

  • npx tsc -p tsconfig.src.json --noEmit passes
  • Dashboard unit tests pass
  • Nightly E2E: sandbox-survival, skip-permissions, sandbox-operations, cloud-e2e pass

Summary by CodeRabbit

  • Bug Fixes
    • Improved gateway health check reliability by using explicit status outputs instead of HTTP code interpretation.
    • Enhanced automatic recovery mechanism for sandbox gateway outages with better restart and port forwarding restoration.
    • Updated diagnostic logging to better reflect gateway restart and connectivity restoration activities.

Reverts the nemoclaw.ts changes from #2398 (dashboard delivery chain
refactor) which introduced hangs in `nemoclaw status` that cause
sandbox-survival, skip-permissions, and sandbox-operations E2E failures.

Restores the original implementations of:
- isSandboxGatewayRunning (probes via curl -sf, not http_code)
- recoverSandboxProcesses (direct SSH recovery, no CORS/download)
- ensureSandboxPortForward (simple forward stop/start)
- checkAndRecoverSandboxProcesses (uses the above, no dashboard chain)

The dashboard-contract, dashboard-health, and dashboard-recover modules
from #2398 are left in place since onboard.ts depends on them. Only the
nemoclaw.ts consumer (status/recovery path) is reverted.

Bisect evidence confirmed #2398 as sole culprit (run 24921496960).
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 25, 2026

📝 Walkthrough

Walkthrough

The sandbox gateway health check mechanism is refactored to probe a health endpoint and interpret explicit RUNNING/STOPPED outputs instead of HTTP status codes. A new recovery function restarts the gateway using an agent-provided or fallback recovery script, and the recovery orchestration is updated to skip link-aware dashboard-chain recovery, restart processes, and re-establish port forwarding.

Changes

Cohort / File(s) Summary
Sandbox Gateway Health & Recovery
src/nemoclaw.ts
Refactored isSandboxGatewayRunning to use curl-based probing and explicit state outputs. Added recoverSandboxProcesses function with agent-provided or fallback recovery scripts. Updated checkAndRecoverSandboxProcesses to remove link-aware dashboard recovery, restart processes, re-validate gateway responsiveness, and re-establish port forwarding with updated log messages.

Sequence Diagram(s)

sequenceDiagram
    participant Host as Host Process
    participant Check as checkAndRecover<br/>SandboxProcesses
    participant Health as isSandboxGateway<br/>Running
    participant Recover as recoverSandbox<br/>Processes
    participant Sandbox as Sandbox<br/>Environment
    participant Gateway as Gateway HTTP<br/>Endpoint
    participant Forward as ensureSandbox<br/>PortForward

    Host->>Check: Initiate recovery check
    Check->>Health: Probe gateway health
    Health->>Gateway: curl -sf (health endpoint)
    Gateway-->>Health: Response (RUNNING/STOPPED)
    Health-->>Check: Gateway status
    
    alt Gateway is DOWN
        Check->>Recover: Trigger recovery
        Recover->>Sandbox: Execute recovery script<br/>(agent-provided or fallback)
        Sandbox->>Sandbox: Clean lock/log files
        Sandbox->>Sandbox: Launch: openclaw gateway run
        Recover-->>Check: Recovery complete
        
        Check->>Health: Re-validate gateway
        Health->>Gateway: curl -sf (health endpoint)
        Gateway-->>Health: RUNNING
        Health-->>Check: Gateway restored
        
        Check->>Forward: Re-establish port forward
        Forward->>Host: host → sandbox forward
        Forward-->>Check: Forward active
    else Gateway is RUNNING
        Check-->>Host: No recovery needed
    end
    
    Check->>Host: Emit completion log
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 Hop hop, the gateway now stands tall,
With probes that echo through the hall,
Recovery scripts dance and play,
Ports forward in a brand new way,
Health checks true, no false alarms—
The sandbox thrives with open arms! 🌟

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title clearly and specifically identifies the main change: restoring pre-#2398 gateway recovery to fix E2E hangs, which directly matches the PR's core objective.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch revert/2398-dashboard-refactor

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant