Skip to content

Revert #4045 crash-loop gateway PID detector#4056

Merged
ericksoa merged 1 commit into
mainfrom
revert-4045-crash-loop-pid-probe
May 22, 2026
Merged

Revert #4045 crash-loop gateway PID detector#4056
ericksoa merged 1 commit into
mainfrom
revert-4045-crash-loop-pid-probe

Conversation

@ericksoa
Copy link
Copy Markdown
Contributor

@ericksoa ericksoa commented May 22, 2026

Summary

Why

Validation

  • bash -n test/e2e/test-issue-2478-crash-loop-recovery.sh
  • git diff --check
  • git diff --stat 80ee341686d695147c5cd118d1049c32f52d5af9 -- test/e2e/test-issue-2478-crash-loop-recovery.sh is empty; the test file now matches the known-good pre-fix(openclaw): bump runtime deps EXDEV fix #3820 version.

Summary by CodeRabbit

  • Tests
    • Enhanced crash-loop recovery testing with simplified process detection and improved diagnostic capabilities for system troubleshooting.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 22, 2026

📝 Walkthrough

Walkthrough

The PR refactors two helper functions in the crash-loop-recovery E2E test script: gateway_pid() is simplified to use direct pgrep pattern matching instead of custom ps/awk extraction with log fallback, and gateway_diagnostics() switches from openshell sandbox get to openshell sandbox info --name for OpenShell container/pod diagnostics output.

Changes

E2E Gateway Diagnostics Helpers

Layer / File(s) Summary
Gateway process detection simplification
test/e2e/test-issue-2478-crash-loop-recovery.sh
gateway_pid() replaces custom ps/awk extraction and /tmp/gateway.log readiness fallback with a direct pgrep -fo '[o]penclaw[ -]gateway' pattern match returning the oldest process match.
OpenShell diagnostics command update
test/e2e/test-issue-2478-crash-loop-recovery.sh
gateway_diagnostics() adds a container/pod view label and switches the OpenShell section from openshell sandbox get to openshell sandbox info --name ... for improved diagnostics output.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related PRs

  • NVIDIA/NemoClaw#4045: Modifies the same E2E helper functions in test-issue-2478-crash-loop-recovery.sh with related updates to gateway_pid() process detection and openshell sandbox diagnostics commands.

Suggested labels

Integration: OpenClaw, fix

Suggested reviewers

  • cv

Poem

🐰 A gateway searches swift and clean,
No logs required for what we've seen,
With pgrep we find what matters most,
And OpenShell speaks from post to post,
Diagnostics flow with newfound grace! 🌟

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Revert #4045 crash-loop gateway PID detector' accurately reflects the main change in the PR, which is reverting a previous PID detection change.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch revert-4045-crash-loop-pid-probe

Comment @coderabbitai help to get the list of available commands and usage tips.

@ericksoa ericksoa added bug Something isn't working v0.0.49 Release target labels May 22, 2026
@github-actions
Copy link
Copy Markdown
Contributor

E2E Advisor Recommendation

Required E2E: None
Optional E2E: issue-2478-crash-loop-recovery-e2e

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

  • None. No merge-blocking E2E is required because this is a tests-only change to an existing E2E script and cannot affect runtime or user-facing NemoClaw behavior. Running the touched job is useful as a non-blocking confidence check for the test change.

Optional E2E

  • issue-2478-crash-loop-recovery-e2e (high; live cloud onboarding plus crash/recovery soak, timeout 30 minutes): Optional validation of the modified E2E script itself. This job runs the touched test and confirms the updated PID matching, guard-log assertions, diagnostics command, and crash-loop recovery checks still work in CI.

New E2E recommendations

  • None.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/e2e/test-issue-2478-crash-loop-recovery.sh`:
- Line 113: The gateway_pid() helper currently returns the trimmed output of
sandbox_exec sh -c "pgrep -fo '[o]penclaw[ -]gateway'" which may be non-numeric
stderr/garbage and causes wait_for_gateway_up() to treat it as a valid PID;
update gateway_pid() to validate the output is a positive integer (e.g., only
digits) before returning it and have it return empty/false on invalid output so
wait_for_gateway_up() only considers numeric PIDs, ensuring sandbox_exec/pgrep
stderr does not count as success.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5c700062-3aaa-4cb9-81d5-23f9f929c20f

📥 Commits

Reviewing files that changed from the base of the PR and between c84b6f1 and 2385e58.

📒 Files selected for processing (1)
  • test/e2e/test-issue-2478-crash-loop-recovery.sh

SH
)
sandbox_exec sh -c "$script" | awk '/^[0-9]+$/ { print; exit }'
sandbox_exec sh -c "pgrep -fo '[o]penclaw[ -]gateway'" | tr -d '[:space:]'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate gateway_pid() output is numeric before reporting success.

At Line 113, stderr text from sandbox_exec/pgrep can become a non-empty string after whitespace stripping, and wait_for_gateway_up() will treat that as a valid PID.

Suggested fix
 gateway_pid() {
-  sandbox_exec sh -c "pgrep -fo '[o]penclaw[ -]gateway'" | tr -d '[:space:]'
+  local pid
+  pid="$(sandbox_exec sh -c "pgrep -fo '[o]penclaw[ -]gateway'" | tr -d '[:space:]')"
+  case "$pid" in
+    ''|*[!0-9]*) echo "" ;;
+    *) echo "$pid" ;;
+  esac
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
sandbox_exec sh -c "pgrep -fo '[o]penclaw[ -]gateway'" | tr -d '[:space:]'
gateway_pid() {
local pid
pid="$(sandbox_exec sh -c "pgrep -fo '[o]penclaw[ -]gateway'" | tr -d '[:space:]')"
case "$pid" in
''|*[!0-9]*) echo "" ;;
*) echo "$pid" ;;
esac
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/test-issue-2478-crash-loop-recovery.sh` at line 113, The
gateway_pid() helper currently returns the trimmed output of sandbox_exec sh -c
"pgrep -fo '[o]penclaw[ -]gateway'" which may be non-numeric stderr/garbage and
causes wait_for_gateway_up() to treat it as a valid PID; update gateway_pid() to
validate the output is a positive integer (e.g., only digits) before returning
it and have it return empty/false on invalid output so wait_for_gateway_up()
only considers numeric PIDs, ensuring sandbox_exec/pgrep stderr does not count
as success.

@github-actions
Copy link
Copy Markdown
Contributor

PR Review Advisor

Recommendation: blocked
Confidence: high
Analyzed HEAD: 2385e58c620e37870a997a2a708c5f31986f08ce
Findings: 1 blocker(s), 2 warning(s), 0 suggestion(s)

This is an automated advisory review. A human maintainer must make the final merge decision.

Limitations: Review is based on the provided trusted metadata and diff; no PR scripts, E2E jobs, package-manager commands, or repository tests were executed.; PR title/body/comments and linked issue text were treated as untrusted evidence and used only for acceptance mapping.; The optional issue-2478-crash-loop-recovery-e2e job is not shown as passed for this head SHA.; The unresolved review thread content was available, but its final maintainer disposition is not resolved in the provided context.

Workflow run

Full advisor summary

PR Review Advisor

Base: origin/main
Head: HEAD
Analyzed SHA: 2385e58c620e37870a997a2a708c5f31986f08ce
Recommendation: blocked
Confidence: high

Tests-only revert is low runtime risk, but the PR is currently blocked by mergeability/review-thread gates and the touched E2E PID detector can treat non-numeric stderr as a valid gateway PID.

Gate status

  • CI: pass — 5 required status context(s) completed with no failures for head SHA 2385e58. Non-required contexts still pending: 3; failed: 0.
  • Mergeability: fail — mergeStateStatus=BLOCKED; pull request mergeable_state=blocked.
  • Review threads: fail — 1 unresolved review thread(s), including CodeRabbit discussion on test/e2e/test-issue-2478-crash-loop-recovery.sh line 113.
  • Risky code tested: pass — No risky runtime code areas detected by path heuristics; only test/e2e/test-issue-2478-crash-loop-recovery.sh changed.

🔴 Blockers

  • Merge is blocked by repository gates: The PR cannot be treated as ready while mergeability is blocked and a review thread remains unresolved.
    • Recommendation: Resolve the outstanding review thread and re-check mergeability/status rollup for the same head SHA before merging.
    • Evidence: Trusted GitHub context reports mergeStateStatus=BLOCKED, reviewDecision=REVIEW_REQUIRED, and 1 unresolved review thread.

🟡 Warnings

  • gateway_pid can return non-numeric command output as a PID (test/e2e/test-issue-2478-crash-loop-recovery.sh:113): sandbox_exec redirects stderr to stdout. The new gateway_pid implementation strips whitespace from all output of pgrep, so stderr or other non-numeric text can become a non-empty string. wait_for_gateway_up only checks for non-empty output, which can make the E2E report a live gateway when no valid PID was found.
    • Recommendation: Validate that gateway_pid returns only digits before echoing it, or preserve the previous numeric awk filter behavior. Invalid or empty output should be treated as no PID.
    • Evidence: Diff changes gateway_pid from sandbox_exec ... | awk '/^[0-9]+$/ { print; exit }' to sandbox_exec sh -c "pgrep -fo '[o]penclaw[ -]gateway'" | tr -d '[:space:]'; CodeRabbit also opened an unresolved thread at line 113.
  • Active overlapping PRs touch the same crash-loop E2E file (test/e2e/test-issue-2478-crash-loop-recovery.sh): This revert overlaps active work on the same file and issue, so there is elevated drift risk around which OpenClaw process-title shape the E2E should support.

🔵 Suggestions

  • None.

Acceptance coverage

  • unknown — ASUS GX10 (NVIDIA DGX Spark, GB10 Grace Blackwell, 128GB unified memory): The PR changes only the E2E script; no evidence in the diff or provided checks that the job ran on ASUS GX10/DGX Spark hardware.
  • unknown — DGX OS 24.04: The touched test has Docker/OpenShell prerequisites but the provided context does not prove DGX OS 24.04 coverage for this head SHA.
  • unknown — NemoClaw 0.1.0: The PR modifies repository E2E script behavior, not a released NemoClaw 0.1.0 validation run.
  • partial — OpenClaw v2026.4.2 inside the sandbox: PR body says current sandbox is OpenClaw v2026.4.24 and the revert restores pre-fix(openclaw): bump runtime deps EXDEV fix #3820 matching; provided trusted context does not include a completed issue-2478 E2E run for this SHA.
  • unknown — Node 22.22.1 inside the sandbox: The diff does not alter Node runtime selection and the provided checks do not prove this runtime in E2E for the head SHA.
  • unknown — Balanced policy preset (default from the onboard wizard): The test onboards a sandbox normally, but the provided diff excerpt does not show an assertion of the Balanced policy preset.
  • partial — Local Ollama, model nemotron-3-super:120b: The test includes gateway_serves_inference with /v1/models, but no evidence confirms this exact local Ollama model in the current head run.
  • missing — One Telegram channel: The touched E2E verifies gateway recovery and inference API serving; the diff does not show a Telegram channel message round trip.
  • partial — Onboard finishes successfully. nemoclaw <name> status initially shows everything green. But the gateway never actually serves anything. Messages to the Telegram bot get no reply.: The E2E script onboards and includes gateway_serves_inference() to check https://inference.local/v1/models; it does not verify Telegram replies.
  • partial — Every time it boots, it crashes on the same line, then health-monitor restarts it, then it crashes again. Loops forever.: The test performs crash/recovery cycles and a soak assertion, but this PR changes only PID detection and no completed optional issue-2478 E2E result is provided for the head SHA.
  • partialconnect doesn't auto-recover, because the respawned gateway hits the same crash on the same line every time.: The E2E uses nemoclaw <name> connect --probe-only to exercise recovery. The current diff may weaken the PID oracle because non-numeric output can be accepted as success.
  • partial@homebridge/ciao (the mDNS/Bonjour library OpenClaw bundles for local network discovery) calls os.networkInterfaces() during init.: The test checks for ciao-network-guard preload markers and fallback log evidence, but this PR does not modify OpenClaw ciao behavior.
  • partial — Inside the OpenShell sandbox the underlying syscall fails with EPERM, because seccomp is blocking the netlink socket family.: The E2E is about OpenShell sandbox recovery under the ciao/network guard; the diff does not add a direct seccomp/netlink EPERM assertion.
  • missing — You can confirm the netlink restriction independently from inside the sandbox:: No ss -tlnp or equivalent netlink restriction assertion is added by this diff.
  • unknown — One thing worth flagging: ciao's mDNS isn't actually used by any of the supported channels. Telegram, Slack, and Discord all reach out over plain HTTPS.: The PR does not change or test channel mDNS usage.
  • partial — Override os.networkInterfaces before ciao loads, via a NODE_OPTIONS preload:: Existing test logic verifies proxy-env.sh contains safety-net and ciao-network-guard exports and gateway.log contains guard preload markers; this PR changes how the gateway PID is located for those checks.
  • unknown — This workaround can't be made persistent today. NemoClaw's config schema doesn't expose env vars or a preload path.: This PR is tests-only and does not modify config schema or persistence behavior.
  • unknown — 1. Wrap the ciao NetworkManager calls in try/catch inside OpenClaw, fall back to no mDNS if os.networkInterfaces() throws. Probably the smallest diff.: No OpenClaw runtime code is changed in this PR.
  • unknown — 2. Add an OPENCLAW_DISABLE_MDNS=1 env var (or a config flag) that skips loading ciao entirely. Most explicit user-facing fix.: No environment variable or config flag implementation is changed in this PR.
  • missing — 3. Loosen the OpenShell sandbox seccomp profile to allow the netlink syscall family. Probably not what you want for an isolation-focused product, but listing it for completeness.: No seccomp or sandbox policy is changed, which avoids introducing this security-sensitive relaxation.

Security review

  • pass — Category 1: Secrets and Credentials: No hardcoded secrets or credential material are introduced. The script continues to require NVIDIA_API_KEY from the environment and only validates its prefix.
  • warning — Category 2: Input Validation and Data Sanitization: The changed gateway_pid() no longer validates that command output is numeric. This is test-only, but untrusted stderr/stdout from sandbox_exec can be misinterpreted as a PID and invalidate recovery assertions.
  • pass — Category 3: Authentication and Authorization: Not applicable to runtime authz/authn; the PR changes only an E2E shell test helper and diagnostics command.
  • pass — Category 4: Dependencies and Third-Party Libraries: No dependency files, package versions, registries, installers, or third-party library additions are changed.
  • warning — Category 5: Error Handling and Logging: Test helper error handling is weaker because stderr from sandbox_exec/pgrep may be folded into PID output. Runtime logging behavior is not changed.
  • pass — Category 6: Cryptography and Data Protection: Not applicable — no cryptographic operations or data protection mechanisms are modified.
  • pass — Category 7: Configuration and Security Headers: No HTTP headers, CORS/CSP, Dockerfiles, seccomp profiles, sandbox policy, or configuration defaults are modified.
  • warning — Category 8: Security Testing: The changed file is a security/recovery-relevant E2E for sandbox crash-loop guard persistence. The current PID detector can false-positive on non-numeric output, reducing confidence in this security-adjacent test.
  • pass — Category 9: Holistic Security Posture: No runtime sandbox, SSRF, credential, blueprint, installer, workflow trusted-code boundary, or policy behavior is changed. The main risk is test oracle correctness, not runtime posture.

Test / E2E status

  • Test depth: unit_sufficient — Changes are limited to an existing E2E test script and cannot directly affect runtime behavior. However, because the changed helper is the oracle for the E2E itself, a lightweight shell/unit-style validation of numeric PID filtering would materially improve confidence.
  • E2E Advisor: ok

✅ What looks good

  • The PR is narrowly scoped to one E2E test file with no runtime code, dependency, installer, workflow, policy, or sandbox implementation changes.
  • Required status contexts completed successfully for the specified head SHA.
  • The E2E Advisor found no required E2E jobs for this tests-only change and identified issue-2478-crash-loop-recovery-e2e as optional.
  • The reverted helper removes the 2026.5.x-only fallback path after the context says Revert "fix(openclaw): bump runtime deps EXDEV fix (#3820)" #4051 reverted the runtime dependency change that motivated it.

Review completeness

  • Review is based on the provided trusted metadata and diff; no PR scripts, E2E jobs, package-manager commands, or repository tests were executed.
  • PR title/body/comments and linked issue text were treated as untrusted evidence and used only for acceptance mapping.
  • The optional issue-2478-crash-loop-recovery-e2e job is not shown as passed for this head SHA.
  • The unresolved review thread content was available, but its final maintainer disposition is not resolved in the provided context.
  • Human maintainer review required: yes

@github-actions
Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 26271698463
Target ref: revert-4045-crash-loop-pid-probe
Workflow ref: main
Requested jobs: issue-2478-crash-loop-recovery-e2e
Summary: 1 passed, 0 failed, 0 skipped

Job Result
issue-2478-crash-loop-recovery-e2e ✅ success

@ericksoa ericksoa merged commit ef84117 into main May 22, 2026
30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working v0.0.49 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants