Revert #4045 crash-loop gateway PID detector by ericksoa · Pull Request #4056 · NVIDIA/NemoClaw

ericksoa · 2026-05-22T06:13:12Z

Summary

Revert the crash-loop E2E PID-detector change from test: fix crash-loop gateway PID detection #4045.
test: fix crash-loop gateway PID detection #4045 adapted test-issue-2478-crash-loop-recovery.sh for the OpenClaw 2026.5.x process-title shape seen after fix(openclaw): bump runtime deps EXDEV fix #3820.
Revert "fix(openclaw): bump runtime deps EXDEV fix (#3820)" #4051 reverted fix(openclaw): bump runtime deps EXDEV fix #3820 and the latest sandbox is back on OpenClaw 2026.4.24, where the pre-fix(openclaw): bump runtime deps EXDEV fix #3820 detector is the known-good path.

Why

The known-good pre-fix(openclaw): bump runtime deps EXDEV fix #3820 run at 80ee341686d695147c5cd118d1049c32f52d5af9 passed issue-2478-crash-loop-recovery-e2e.
The current failing run on reverted main showed openclaw-gateway alive, [gateway] ready, sandbox Ready, and Agent: OpenClaw v2026.4.24, but the test: fix crash-loop gateway PID detection #4045 detector still returned empty.

Validation

bash -n test/e2e/test-issue-2478-crash-loop-recovery.sh
git diff --check
git diff --stat 80ee341686d695147c5cd118d1049c32f52d5af9 -- test/e2e/test-issue-2478-crash-loop-recovery.sh is empty; the test file now matches the known-good pre-fix(openclaw): bump runtime deps EXDEV fix #3820 version.

Summary by CodeRabbit

Tests
- Enhanced crash-loop recovery testing with simplified process detection and improved diagnostic capabilities for system troubleshooting.

coderabbitai · 2026-05-22T06:13:25Z

📝 Walkthrough

Walkthrough

The PR refactors two helper functions in the crash-loop-recovery E2E test script: gateway_pid() is simplified to use direct pgrep pattern matching instead of custom ps/awk extraction with log fallback, and gateway_diagnostics() switches from openshell sandbox get to openshell sandbox info --name for OpenShell container/pod diagnostics output.

Changes

E2E Gateway Diagnostics Helpers

Layer / File(s)	Summary
Gateway process detection simplification `test/e2e/test-issue-2478-crash-loop-recovery.sh`	`gateway_pid()` replaces custom `ps`/`awk` extraction and `/tmp/gateway.log` readiness fallback with a direct `pgrep -fo '[o]penclaw[ -]gateway'` pattern match returning the oldest process match.
OpenShell diagnostics command update `test/e2e/test-issue-2478-crash-loop-recovery.sh`	`gateway_diagnostics()` adds a container/pod view label and switches the OpenShell section from `openshell sandbox get` to `openshell sandbox info --name ...` for improved diagnostics output.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related PRs

NVIDIA/NemoClaw#4045: Modifies the same E2E helper functions in test-issue-2478-crash-loop-recovery.sh with related updates to gateway_pid() process detection and openshell sandbox diagnostics commands.

Suggested labels

Integration: OpenClaw, fix

Suggested reviewers

cv

Poem

🐰 A gateway searches swift and clean,
No logs required for what we've seen,
With pgrep we find what matters most,
And OpenShell speaks from post to post,
Diagnostics flow with newfound grace! 🌟

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Revert `#4045` crash-loop gateway PID detector' accurately reflects the main change in the PR, which is reverting a previous PID detection change.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch revert-4045-crash-loop-pid-probe

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-05-22T06:14:20Z

E2E Advisor Recommendation

Required E2E: None
Optional E2E: issue-2478-crash-loop-recovery-e2e

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

None. No merge-blocking E2E is required because this is a tests-only change to an existing E2E script and cannot affect runtime or user-facing NemoClaw behavior. Running the touched job is useful as a non-blocking confidence check for the test change.

Optional E2E

issue-2478-crash-loop-recovery-e2e (high; live cloud onboarding plus crash/recovery soak, timeout 30 minutes): Optional validation of the modified E2E script itself. This job runs the touched test and confirms the updated PID matching, guard-log assertions, diagnostics command, and crash-loop recovery checks still work in CI.

New E2E recommendations

None.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/e2e/test-issue-2478-crash-loop-recovery.sh`:
- Line 113: The gateway_pid() helper currently returns the trimmed output of
sandbox_exec sh -c "pgrep -fo '[o]penclaw[ -]gateway'" which may be non-numeric
stderr/garbage and causes wait_for_gateway_up() to treat it as a valid PID;
update gateway_pid() to validate the output is a positive integer (e.g., only
digits) before returning it and have it return empty/false on invalid output so
wait_for_gateway_up() only considers numeric PIDs, ensuring sandbox_exec/pgrep
stderr does not count as success.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5c700062-3aaa-4cb9-81d5-23f9f929c20f

📥 Commits

Reviewing files that changed from the base of the PR and between c84b6f1 and 2385e58.

📒 Files selected for processing (1)

test/e2e/test-issue-2478-crash-loop-recovery.sh

coderabbitai · 2026-05-22T06:15:25Z

-SH
-  )
-  sandbox_exec sh -c "$script" | awk '/^[0-9]+$/ { print; exit }'
+  sandbox_exec sh -c "pgrep -fo '[o]penclaw[ -]gateway'" | tr -d '[:space:]'


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate gateway_pid() output is numeric before reporting success.

At Line 113, stderr text from sandbox_exec/pgrep can become a non-empty string after whitespace stripping, and wait_for_gateway_up() will treat that as a valid PID.

Suggested fix

gateway_pid() { - sandbox_exec sh -c "pgrep -fo '[o]penclaw[ -]gateway'" | tr -d '[:space:]' + local pid + pid="$(sandbox_exec sh -c "pgrep -fo '[o]penclaw[ -]gateway'" | tr -d '[:space:]')" + case "$pid" in + ''|*[!0-9]*) echo "" ;; + *) echo "$pid" ;; + esac }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

sandbox_exec sh -c "pgrep -fo '[o]penclaw[ -]gateway'" | tr -d '[:space:]'

gateway_pid() {

local pid

pid="$(sandbox_exec sh -c "pgrep -fo '[o]penclaw[ -]gateway'" | tr -d '[:space:]')"

case "$pid" in

''|*[!0-9]*) echo "" ;;

*) echo "$pid" ;;

esac

}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/e2e/test-issue-2478-crash-loop-recovery.sh` at line 113, The gateway_pid() helper currently returns the trimmed output of sandbox_exec sh -c "pgrep -fo '[o]penclaw[ -]gateway'" which may be non-numeric stderr/garbage and causes wait_for_gateway_up() to treat it as a valid PID; update gateway_pid() to validate the output is a positive integer (e.g., only digits) before returning it and have it return empty/false on invalid output so wait_for_gateway_up() only considers numeric PIDs, ensuring sandbox_exec/pgrep stderr does not count as success.

github-actions · 2026-05-22T06:20:11Z

PR Review Advisor

Recommendation: blocked
Confidence: high
Analyzed HEAD: 2385e58c620e37870a997a2a708c5f31986f08ce
Findings: 1 blocker(s), 2 warning(s), 0 suggestion(s)

This is an automated advisory review. A human maintainer must make the final merge decision.

Limitations: Review is based on the provided trusted metadata and diff; no PR scripts, E2E jobs, package-manager commands, or repository tests were executed.; PR title/body/comments and linked issue text were treated as untrusted evidence and used only for acceptance mapping.; The optional issue-2478-crash-loop-recovery-e2e job is not shown as passed for this head SHA.; The unresolved review thread content was available, but its final maintainer disposition is not resolved in the provided context.

Workflow run

Full advisor summary

PR Review Advisor

Base: origin/main
Head: HEAD
Analyzed SHA: 2385e58c620e37870a997a2a708c5f31986f08ce
Recommendation: blocked
Confidence: high

Tests-only revert is low runtime risk, but the PR is currently blocked by mergeability/review-thread gates and the touched E2E PID detector can treat non-numeric stderr as a valid gateway PID.

Gate status

CI: pass — 5 required status context(s) completed with no failures for head SHA 2385e58. Non-required contexts still pending: 3; failed: 0.
Mergeability: fail — mergeStateStatus=BLOCKED; pull request mergeable_state=blocked.
Review threads: fail — 1 unresolved review thread(s), including CodeRabbit discussion on test/e2e/test-issue-2478-crash-loop-recovery.sh line 113.
Risky code tested: pass — No risky runtime code areas detected by path heuristics; only test/e2e/test-issue-2478-crash-loop-recovery.sh changed.

🔴 Blockers

Merge is blocked by repository gates: The PR cannot be treated as ready while mergeability is blocked and a review thread remains unresolved.
- Recommendation: Resolve the outstanding review thread and re-check mergeability/status rollup for the same head SHA before merging.
- Evidence: Trusted GitHub context reports mergeStateStatus=BLOCKED, reviewDecision=REVIEW_REQUIRED, and 1 unresolved review thread.

🟡 Warnings

gateway_pid can return non-numeric command output as a PID (test/e2e/test-issue-2478-crash-loop-recovery.sh:113): sandbox_exec redirects stderr to stdout. The new gateway_pid implementation strips whitespace from all output of pgrep, so stderr or other non-numeric text can become a non-empty string. wait_for_gateway_up only checks for non-empty output, which can make the E2E report a live gateway when no valid PID was found.
- Recommendation: Validate that gateway_pid returns only digits before echoing it, or preserve the previous numeric awk filter behavior. Invalid or empty output should be treated as no PID.
- Evidence: Diff changes gateway_pid from sandbox_exec ... | awk '/^[0-9]+$/ { print; exit }' to sandbox_exec sh -c "pgrep -fo '[o]penclaw[ -]gateway'" | tr -d '[:space:]'; CodeRabbit also opened an unresolved thread at line 113.
Active overlapping PRs touch the same crash-loop E2E file (test/e2e/test-issue-2478-crash-loop-recovery.sh): This revert overlaps active work on the same file and issue, so there is elevated drift risk around which OpenClaw process-title shape the E2E should support.
- Recommendation: Coordinate with overlapping PRs before merge, especially PR test: match OpenClaw argv in crash-loop PID probe #4049 and PR chore: upgrade agent runtime dependencies #3925, to avoid reverting or duplicating active PID-probe work.
- Evidence: Trusted drift context lists openPrOverlaps: test: match OpenClaw argv in crash-loop PID probe #4049 test: match OpenClaw argv in crash-loop PID probe linked to [DGX Spark] Gateway crash loop on startup: @homebridge/ciao networkInterfaces() returns EPERM in OpenShell sandbox #2478 and chore: upgrade agent runtime dependencies #3925 chore: upgrade agent runtime dependencies, both touching test/e2e/test-issue-2478-crash-loop-recovery.sh.

🔵 Suggestions

None.

Acceptance coverage

unknown — ASUS GX10 (NVIDIA DGX Spark, GB10 Grace Blackwell, 128GB unified memory): The PR changes only the E2E script; no evidence in the diff or provided checks that the job ran on ASUS GX10/DGX Spark hardware.
unknown — DGX OS 24.04: The touched test has Docker/OpenShell prerequisites but the provided context does not prove DGX OS 24.04 coverage for this head SHA.
unknown — NemoClaw 0.1.0: The PR modifies repository E2E script behavior, not a released NemoClaw 0.1.0 validation run.
partial — OpenClaw v2026.4.2 inside the sandbox: PR body says current sandbox is OpenClaw v2026.4.24 and the revert restores pre-fix(openclaw): bump runtime deps EXDEV fix #3820 matching; provided trusted context does not include a completed issue-2478 E2E run for this SHA.
unknown — Node 22.22.1 inside the sandbox: The diff does not alter Node runtime selection and the provided checks do not prove this runtime in E2E for the head SHA.
unknown — Balanced policy preset (default from the onboard wizard): The test onboards a sandbox normally, but the provided diff excerpt does not show an assertion of the Balanced policy preset.
partial — Local Ollama, model nemotron-3-super:120b: The test includes gateway_serves_inference with /v1/models, but no evidence confirms this exact local Ollama model in the current head run.
missing — One Telegram channel: The touched E2E verifies gateway recovery and inference API serving; the diff does not show a Telegram channel message round trip.
partial — Onboard finishes successfully. nemoclaw <name> status initially shows everything green. But the gateway never actually serves anything. Messages to the Telegram bot get no reply.: The E2E script onboards and includes gateway_serves_inference() to check https://inference.local/v1/models; it does not verify Telegram replies.
partial — Every time it boots, it crashes on the same line, then health-monitor restarts it, then it crashes again. Loops forever.: The test performs crash/recovery cycles and a soak assertion, but this PR changes only PID detection and no completed optional issue-2478 E2E result is provided for the head SHA.
partial — connect doesn't auto-recover, because the respawned gateway hits the same crash on the same line every time.: The E2E uses nemoclaw <name> connect --probe-only to exercise recovery. The current diff may weaken the PID oracle because non-numeric output can be accepted as success.
partial — @homebridge/ciao (the mDNS/Bonjour library OpenClaw bundles for local network discovery) calls os.networkInterfaces() during init.: The test checks for ciao-network-guard preload markers and fallback log evidence, but this PR does not modify OpenClaw ciao behavior.
partial — Inside the OpenShell sandbox the underlying syscall fails with EPERM, because seccomp is blocking the netlink socket family.: The E2E is about OpenShell sandbox recovery under the ciao/network guard; the diff does not add a direct seccomp/netlink EPERM assertion.
missing — You can confirm the netlink restriction independently from inside the sandbox:: No ss -tlnp or equivalent netlink restriction assertion is added by this diff.
unknown — One thing worth flagging: ciao's mDNS isn't actually used by any of the supported channels. Telegram, Slack, and Discord all reach out over plain HTTPS.: The PR does not change or test channel mDNS usage.
partial — Override os.networkInterfaces before ciao loads, via a NODE_OPTIONS preload:: Existing test logic verifies proxy-env.sh contains safety-net and ciao-network-guard exports and gateway.log contains guard preload markers; this PR changes how the gateway PID is located for those checks.
unknown — This workaround can't be made persistent today. NemoClaw's config schema doesn't expose env vars or a preload path.: This PR is tests-only and does not modify config schema or persistence behavior.
unknown — 1. Wrap the ciao NetworkManager calls in try/catch inside OpenClaw, fall back to no mDNS if os.networkInterfaces() throws. Probably the smallest diff.: No OpenClaw runtime code is changed in this PR.
unknown — 2. Add an OPENCLAW_DISABLE_MDNS=1 env var (or a config flag) that skips loading ciao entirely. Most explicit user-facing fix.: No environment variable or config flag implementation is changed in this PR.
missing — 3. Loosen the OpenShell sandbox seccomp profile to allow the netlink syscall family. Probably not what you want for an isolation-focused product, but listing it for completeness.: No seccomp or sandbox policy is changed, which avoids introducing this security-sensitive relaxation.

Security review

pass — Category 1: Secrets and Credentials: No hardcoded secrets or credential material are introduced. The script continues to require NVIDIA_API_KEY from the environment and only validates its prefix.
warning — Category 2: Input Validation and Data Sanitization: The changed gateway_pid() no longer validates that command output is numeric. This is test-only, but untrusted stderr/stdout from sandbox_exec can be misinterpreted as a PID and invalidate recovery assertions.
pass — Category 3: Authentication and Authorization: Not applicable to runtime authz/authn; the PR changes only an E2E shell test helper and diagnostics command.
pass — Category 4: Dependencies and Third-Party Libraries: No dependency files, package versions, registries, installers, or third-party library additions are changed.
warning — Category 5: Error Handling and Logging: Test helper error handling is weaker because stderr from sandbox_exec/pgrep may be folded into PID output. Runtime logging behavior is not changed.
pass — Category 6: Cryptography and Data Protection: Not applicable — no cryptographic operations or data protection mechanisms are modified.
pass — Category 7: Configuration and Security Headers: No HTTP headers, CORS/CSP, Dockerfiles, seccomp profiles, sandbox policy, or configuration defaults are modified.
warning — Category 8: Security Testing: The changed file is a security/recovery-relevant E2E for sandbox crash-loop guard persistence. The current PID detector can false-positive on non-numeric output, reducing confidence in this security-adjacent test.
pass — Category 9: Holistic Security Posture: No runtime sandbox, SSRF, credential, blueprint, installer, workflow trusted-code boundary, or policy behavior is changed. The main risk is test oracle correctness, not runtime posture.

Test / E2E status

Test depth: unit_sufficient — Changes are limited to an existing E2E test script and cannot directly affect runtime behavior. However, because the changed helper is the oracle for the E2E itself, a lightweight shell/unit-style validation of numeric PID filtering would materially improve confidence.
E2E Advisor: ok

✅ What looks good

The PR is narrowly scoped to one E2E test file with no runtime code, dependency, installer, workflow, policy, or sandbox implementation changes.
Required status contexts completed successfully for the specified head SHA.
The E2E Advisor found no required E2E jobs for this tests-only change and identified issue-2478-crash-loop-recovery-e2e as optional.
The reverted helper removes the 2026.5.x-only fallback path after the context says Revert "fix(openclaw): bump runtime deps EXDEV fix (#3820)" #4051 reverted the runtime dependency change that motivated it.

Review completeness

Review is based on the provided trusted metadata and diff; no PR scripts, E2E jobs, package-manager commands, or repository tests were executed.
PR title/body/comments and linked issue text were treated as untrusted evidence and used only for acceptance mapping.
The optional issue-2478-crash-loop-recovery-e2e job is not shown as passed for this head SHA.
The unresolved review thread content was available, but its final maintainer disposition is not resolved in the provided context.
Human maintainer review required: yes

github-actions · 2026-05-22T06:28:13Z

Selective E2E Results — ✅ All requested jobs passed

Run: 26271698463
Target ref: revert-4045-crash-loop-pid-probe
Workflow ref: main
Requested jobs: issue-2478-crash-loop-recovery-e2e
Summary: 1 passed, 0 failed, 0 skipped

Job	Result
issue-2478-crash-loop-recovery-e2e	✅ success

Revert "test: fix crash-loop gateway PID detection"

2385e58

ericksoa added bug Something isn't working v0.0.49 Release target labels May 22, 2026

coderabbitai Bot reviewed May 22, 2026

View reviewed changes

ericksoa merged commit ef84117 into main May 22, 2026
30 checks passed

-  sandbox_exec sh -c "pgrep -fo '[o]penclaw[ -]gateway'" | tr -d '[:space:]'
+gateway_pid() {
+  local pid
+  pid="$(sandbox_exec sh -c "pgrep -fo '[o]penclaw[ -]gateway'" | tr -d '[:space:]')"
+  case "$pid" in
+    ''|*[!0-9]*) echo "" ;;
+    *) echo "$pid" ;;
+  esac
+}

Conversation

ericksoa commented May 22, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Validation

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Uh oh!

github-actions Bot commented May 22, 2026

E2E Advisor Recommendation

E2E Recommendation Advisor

Required E2E

Optional E2E

New E2E recommendations

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 22, 2026

PR Review Advisor

PR Review Advisor

Gate status

🔴 Blockers

🟡 Warnings

🔵 Suggestions

Acceptance coverage

Security review

Test / E2E status

✅ What looks good

Review completeness

Uh oh!

github-actions Bot commented May 22, 2026

Selective E2E Results — ✅ All requested jobs passed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ericksoa commented May 22, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 22, 2026 •

edited

Loading