fix(sandbox): fix non-root gateway startup and add crash safety net#2472
fix(sandbox): fix non-root gateway startup and add crash safety net#2472
Conversation
…og readable Two fixes for the messaging-providers-e2e Phase 7 Slack guard test that has never passed since #2355: 1. Add the Slack channel guard to the proxy-env.sh sourced file so interactive sessions (openshell sandbox connect/exec) see the guard in NODE_OPTIONS. The guard file is installed after proxy-env.sh is written, so use a runtime conditional ([ -f ... ]) in the sourced script. This fixes the misleading diagnostic that showed NODE_OPTIONS without the guard. 2. Change gateway.log permissions from 600 to 644 so E2E diagnostics (openshell sandbox exec -- cat /tmp/gateway.log) can read the log without being the gateway user. The log doesn't contain secrets.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds a global Node.js sandbox safety-net preload and a ciao/mDNS guard preload; reorders NODE_OPTIONS (safety-net first, ciao always, Slack guard conditional); makes rc-file locking best-effort; permits Changes
Sequence Diagram(s)sequenceDiagram
participant Shell as User Shell
participant Start as scripts/nemoclaw-start.sh
participant Node as Node.js Process
participant Safety as Safety-net Preload
participant Ciao as Ciao/mDNS Guard
participant App as Application
Note over Start,Node: Build NODE_OPTIONS (safety-net first,\nciao guard always, slack guard if present)
Shell->>Start: launch sandbox (OPENSHELL_SANDBOX=1)
Start->>Node: exec node with NODE_OPTIONS (preloads)
Node->>Safety: require safety-net preload
Safety->>Node: install uncaughtException / unhandledRejection handlers
Node->>Ciao: require ciao guard preload
Ciao->>Node: monkey-patch os.networkInterfaces() to safe-return {}
Node->>App: load and run application
App-->>Safety: runtime error / unhandled rejection
Safety-->>App: swallow/log error and optionally prevent exit
App->>Node: normal or recovered termination
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/nemoclaw-start.sh`:
- Around line 1399-1404: Change the permission of /tmp/gateway.log to match the
validator expectations: replace the chmod 644 call with chmod 600 so the file
created in nemoclaw-start.sh remains owned by gateway:gateway and is only
readable by owner; ensure this aligns with validate_tmp_permissions (the sandbox
tmp-permissions validator) and keep the existing touch and chown/gateway
ownership calls intact.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 22e388c4-ba43-4f37-8cb1-e2b31e8434dd
📒 Files selected for processing (1)
scripts/nemoclaw-start.sh
The guard file doesn't exist in the sandbox even though openclaw.json should contain "slack". Add logging to install_slack_channel_guard when the grep fails (reports file existence/readability) and add E2E diagnostics to check the grep result and container logs for guard skip/install messages.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
scripts/nemoclaw-start.sh (1)
1403-1408:⚠️ Potential issue | 🔴 Critical
chmod 644still conflicts with the tmp-permissions validator.The root path still runs
validate_tmp_permissionsbefore launch, so leaving Line 1408 at644will fail startup if that validator still requires/tmp/gateway.logto stay owner-only. Either keep this file at600, or update the validator and every dependent expectation together.You can verify the mismatch with:
#!/usr/bin/env bash set -euo pipefail echo "== gateway.log permissions in startup scripts ==" rg -n -C2 '/tmp/gateway\.log|chmod 644|chmod 600' \ scripts/nemoclaw-start.sh \ scripts/lib/sandbox-init.sh \ agents/hermes/start.sh echo echo "== validate_tmp_permissions implementation ==" rg -n -A60 -B5 'validate_tmp_permissions\s*\(\)' scripts/lib/sandbox-init.shExpected result:
scripts/nemoclaw-start.shshowschmod 644, whilescripts/lib/sandbox-init.shstill documents or enforces600for/tmp/gateway.log.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/nemoclaw-start.sh` around lines 1403 - 1408, The startup script creates /tmp/gateway.log with chmod 644 which conflicts with the existing validate_tmp_permissions logic (validate_tmp_permissions in scripts/lib/sandbox-init.sh) that expects owner-only perms; change the chmod in the block where /tmp/gateway.log is touched/chowned in scripts/nemoclaw-start.sh from 644 to 600 to match the validator (or alternatively, if you intend to relax the validator, update validate_tmp_permissions and all dependent expectations (including agents/hermes/start.sh and any tests) together), ensuring the file remains owned by gateway:gateway and permission checks in validate_tmp_permissions still pass.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/e2e/test-messaging-providers.sh`:
- Around line 675-676: Replace the grep+head pipeline with grep's match-limit
option: in the command that sets the container_log variable (the line using
nemoclaw "$SANDBOX_NAME" logs ... | grep -i "channel
guard\|slack.*guard\|guard.*skip\|guard.*install" | head -5 || echo "no guard
messages"), remove the pipe to head and add grep -m 5 to limit matches; preserve
the case-insensitive -i and the trailing || echo fallback so container_log still
falls back to "no guard messages" when there are no matches.
---
Duplicate comments:
In `@scripts/nemoclaw-start.sh`:
- Around line 1403-1408: The startup script creates /tmp/gateway.log with chmod
644 which conflicts with the existing validate_tmp_permissions logic
(validate_tmp_permissions in scripts/lib/sandbox-init.sh) that expects
owner-only perms; change the chmod in the block where /tmp/gateway.log is
touched/chowned in scripts/nemoclaw-start.sh from 644 to 600 to match the
validator (or alternatively, if you intend to relax the validator, update
validate_tmp_permissions and all dependent expectations (including
agents/hermes/start.sh and any tests) together), ensuring the file remains owned
by gateway:gateway and permission checks in validate_tmp_permissions still pass.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 93f07b20-4343-495f-8a3d-7cbbac4c1d43
📒 Files selected for processing (2)
scripts/nemoclaw-start.shtest/e2e/test-messaging-providers.sh
nemoclaw logs reads /tmp/gateway.log, not container stderr. The entrypoint guard messages go to stderr (Docker logs). Try openshell sandbox logs and docker logs directly to find guard installation messages.
There was a problem hiding this comment.
♻️ Duplicate comments (1)
test/e2e/test-messaging-providers.sh (1)
675-680:⚠️ Potential issue | 🟡 MinorAvoid false “no guard messages” fallbacks under
pipefail.With
set -o pipefail(Line 58),grep ... | head -10can return non-zero (SIGPIPE ongrep) after matching lines, which incorrectly triggers the|| echo ...fallback. This can hide real guard diagnostics.Suggested fix
- container_log=$(openshell sandbox logs --name "$SANDBOX_NAME" 2>&1 | grep -i "channel guard\|slack.*guard\|guard.*skip\|guard.*install\|\[channels\].*slack\|\[channels\].*guard" | head -10 || echo "no guard messages in openshell logs") + container_log=$(openshell sandbox logs --name "$SANDBOX_NAME" 2>&1 | grep -im 10 "channel guard\|slack.*guard\|guard.*skip\|guard.*install\|\[channels\].*slack\|\[channels\].*guard" || echo "no guard messages in openshell logs") @@ - docker_log=$(docker logs "$container_id" 2>&1 | grep -i "channel guard\|slack.*guard\|\[channels\]" | head -10 || echo "no guard messages in docker logs") + docker_log=$(docker logs "$container_id" 2>&1 | grep -im 10 "channel guard\|slack.*guard\|\[channels\]" || echo "no guard messages in docker logs")#!/usr/bin/env bash set -uo pipefail echo "Repro: grep|head can trigger fallback even when matches exist" out=$( printf 'match\n%.0s' {1..50} \ | grep -i 'match' \ | head -10 \ || echo 'FALLBACK_TRIGGERED' ) printf '%s\n' "$out" echo echo "Control: grep -m avoids SIGPIPE fallback" out2=$( printf 'match\n%.0s' {1..50} \ | grep -im 10 'match' \ || echo 'FALLBACK_TRIGGERED' ) printf '%s\n' "$out2"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/test-messaging-providers.sh` around lines 675 - 680, The fallback is being incorrectly triggered due to SIGPIPE when using "grep ... | head -10" under pipefail; update the two places that set container_log and docker_log (the assignments referencing openshell logs and docker logs) to avoid a piped head: replace the grep|head pipeline with grep -m 10 (use the -m/--max-count option) so grep stops after 10 matches and won't emit SIGPIPE, ensuring the "|| echo 'no guard messages...'" fallback only runs when there truly are no matches.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@test/e2e/test-messaging-providers.sh`:
- Around line 675-680: The fallback is being incorrectly triggered due to
SIGPIPE when using "grep ... | head -10" under pipefail; update the two places
that set container_log and docker_log (the assignments referencing openshell
logs and docker logs) to avoid a piped head: replace the grep|head pipeline with
grep -m 10 (use the -m/--max-count option) so grep stops after 10 matches and
won't emit SIGPIPE, ensuring the "|| echo 'no guard messages...'" fallback only
runs when there truly are no matches.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 2dfb655a-0981-42e1-a90b-d1826e713a26
📒 Files selected for processing (1)
test/e2e/test-messaging-providers.sh
List all nemoclaw-* and gateway.log files in /tmp to see exactly what the entrypoint created vs what's missing.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
test/e2e/test-messaging-providers.sh (1)
678-684:⚠️ Potential issue | 🟡 MinorAvoid
grep | headwithpipefailin these log captures.Line 678 and Line 683 can emit fallback text even when matches exist (SIGPIPE from
headmakes the pipeline fail underpipefail), which pollutes diagnostics. This was already raised previously in the PR discussion.Suggested fix
- container_log=$(openshell sandbox logs --name "$SANDBOX_NAME" 2>&1 | grep -i "channel guard\|slack.*guard\|guard.*skip\|guard.*install\|\[channels\].*slack\|\[channels\].*guard" | head -10 || echo "no guard messages in openshell logs") + container_log=$(openshell sandbox logs --name "$SANDBOX_NAME" 2>&1 | grep -im 10 "channel guard\|slack.*guard\|guard.*skip\|guard.*install\|\[channels\].*slack\|\[channels\].*guard" || echo "no guard messages in openshell logs") @@ - container_id=$(openshell sandbox exec --name "$SANDBOX_NAME" -- cat /proc/1/cgroup 2>/dev/null | grep -oP '[a-f0-9]{64}' | head -1 || echo "") + container_id=$(openshell sandbox exec --name "$SANDBOX_NAME" -- cat /proc/1/cgroup 2>/dev/null | grep -oPm 1 '[a-f0-9]{64}' || echo "") @@ - docker_log=$(docker logs "$container_id" 2>&1 | grep -i "channel guard\|slack.*guard\|\[channels\]" | head -10 || echo "no guard messages in docker logs") + docker_log=$(docker logs "$container_id" 2>&1 | grep -im 10 "channel guard\|slack.*guard\|\[channels\]" || echo "no guard messages in docker logs")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/test-messaging-providers.sh` around lines 678 - 684, The pipelines that capture logs (the commands assigning container_log and docker_log and the container_id retrieval) use "grep ... | head -10" which can produce SIGPIPE under pipefail and return the fallback text; change these to avoid piping into head (e.g., use grep's -m 10 to limit matches or use a single-tool solution like awk/sed to select the first 10 matches) so the pipeline won't fail with SIGPIPE. Update the two places that create container_log and docker_log (and any similar openshell/docker log captures) to use grep -m 10 or an equivalent single-process selector to reliably return up to 10 matches without causing a pipe failure.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/e2e/test-messaging-providers.sh`:
- Around line 665-667: The current tmp_files assignment calls openshell sandbox
exec with an unquoted glob (/tmp/nemoclaw-*) so the caller shell may expand it;
change the invocation of openshell sandbox exec (the command that sets
tmp_files) to run a shell inside the sandbox (e.g., use sh -c with a
single-quoted command) so ls -la /tmp/nemoclaw-* /tmp/gateway.log is executed
and expanded inside the sandbox; update the tmp_files variable assignment and
keep the surrounding error fallback and the info " /tmp/nemoclaw-* files:
$tmp_files" unchanged.
---
Duplicate comments:
In `@test/e2e/test-messaging-providers.sh`:
- Around line 678-684: The pipelines that capture logs (the commands assigning
container_log and docker_log and the container_id retrieval) use "grep ... |
head -10" which can produce SIGPIPE under pipefail and return the fallback text;
change these to avoid piping into head (e.g., use grep's -m 10 to limit matches
or use a single-tool solution like awk/sed to select the first 10 matches) so
the pipeline won't fail with SIGPIPE. Update the two places that create
container_log and docker_log (and any similar openshell/docker log captures) to
use grep -m 10 or an equivalent single-process selector to reliably return up to
10 matches without causing a pipe failure.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 35e4ab05-ccaa-495b-af0f-d820d3edfff1
📒 Files selected for processing (1)
test/e2e/test-messaging-providers.sh
The /tmp/nemoclaw-* glob was expanding on the host shell before being passed to openshell sandbox exec, showing host files instead of sandbox files. Wrap in bash -c to expand inside the container.
Write breadcrumb timestamps to /tmp/nemoclaw-entrypoint-trace.log at key points: after proxy-env, before root/non-root branch, before each guard install call. Read the trace in the E2E diagnostic. This will show exactly where the entrypoint stops executing.
Root cause: install_configure_guard() tries to write to /sandbox/.bashrc which is Landlock read-only at runtime (#804). With set -e active, the write failure kills the entrypoint before install_slack_channel_guard and the gateway startup ever run. The proxy fix and nemotron fix work because they're installed at top level (before the root/non-root branch). The Slack guard and gateway startup are inside the branch and never execute. Fix: check file writability before attempting the .bashrc update. If the file is read-only (Landlock), skip it gracefully. Also add 2>/dev/null || true to the cat redirect as defense-in-depth.
lock_rc_files() calls chmod 444 on .bashrc/.profile which fails under Landlock. With set -e this kills the entrypoint — same root cause as the install_configure_guard fix. Add || true so it degrades gracefully.
gateway.log was changed from 600 to 644 for diagnostic readability. Update the validate_tmp_permissions check to expect 644 for gateway.log so it doesn't fail and kill the entrypoint under set -e.
The trace still dies before install_configure_guard. Add per-line traces to identify which of verify_config_integrity, apply_model_override, apply_cors_override, apply_slack_token_override, or token generation is the actual failure point.
The [ -w file ] test checks DAC permissions but cannot detect Landlock enforcement. The sandbox user owns .bashrc (DAC says writable) but Landlock blocks the write at kernel level. Under set -e, the failed write kills the entrypoint before the gateway ever starts. Remove the -w guard entirely and wrap every write operation in || true / continue so Landlock failures are silently skipped.
- Change non-root gateway.log to 644 (matching root path) - Add post-launch diagnostic: check if gateway PID is alive after 3s, dump gateway.log contents to trace file if non-empty
The @homebridge/ciao mDNS library calls os.networkInterfaces() which
throws SystemError (uv_interface_addresses) inside sandboxes with
restricted network namespaces. This crashes the gateway even though
mDNS is not needed for NemoClaw operation.
Add a NODE_OPTIONS preload that:
1. Monkey-patches os.networkInterfaces to return {} on failure
2. Catches the uncaughtException as a fallback for any call sites
that bypass the monkey-patch
Installed unconditionally at top level (same pattern as proxy fix
and nemotron fix) since any sandbox can hit this.
The OpenClaw gateway health monitor kills the entire gateway process when a messaging channel fails to connect within 120s (the channel-connect-grace). With fake/placeholder Slack tokens, the Slack channel auth always fails, and the health monitor kills the gateway after the grace period — even though the Slack guard successfully caught the initial auth error. Set gateway.channelHealthCheckMinutes to 0 in the baked openclaw.json config, which disables the health monitor entirely. In a NemoClaw sandbox, channel health is not critical — inference, chat, and TUI should continue even if a messaging channel is misconfigured.
Replace the global channelHealthCheckMinutes=0 with per-account healthMonitor.enabled=false on each messaging channel. This prevents the health monitor from killing the gateway when a channel has placeholder tokens, while keeping the global health monitor active for inference and other subsystems. OpenClaw supports per-account overrides via accounts.default.healthMonitor.enabled in the channel config.
Any uncaught exception or unhandled rejection from any npm dependency crashes the gateway, killing inference, chat, and TUI. We've been adding per-library guards (proxy fix, Slack guard, ciao guard) but this is whack-a-mole — the next library that does something unexpected in a restricted sandbox will crash the gateway again. Add a global safety net preload (sandbox-safety-net.js) that catches ALL uncaught exceptions and unhandled rejections, logs them, and continues. Only active when OPENSHELL_SANDBOX=1 (set by OpenShell at runtime) — outside a sandbox, normal Node.js crash behavior is preserved. Loaded as the FIRST --require preload so its handlers register before any library code runs. Per-library guards (Slack, ciao) still provide targeted handling with better log messages; the safety net is the last resort for everything else.
OpenClaw installs its own unhandledRejection handler that calls process.exit(1) for non-transient errors. Our safety net catches the rejection first and swallows it, but Node.js delivers the event to ALL listeners — OpenClaw's handler also fires and exits. Monkey-patch process.exit to block exits during the rejection delivery window. A flag (_swallowing) is set during our handler and cleared on the next microtask, so OpenClaw's handler (same tick) hits the intercepted process.exit and the gateway survives.
This test has been failing since March 31 and wastes API tokens on every nightly run without providing actionable signal.
Remove all entrypoint execution traces, /tmp dumps, gateway crash diagnostics, and verbose guard-skip logging added during debugging. Simplify E2E diagnostics to just guard file and NODE_OPTIONS checks.
shfmt reformatted case statement indentation. The nemotron test regex for validate_tmp_permissions was too strict — matched only when _NEMOTRON_FIX_SCRIPT was the second argument, but it's now further in the argument list due to _SANDBOX_SAFETY_NET and _CIAO_GUARD_SCRIPT.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
scripts/nemoclaw-start.sh (2)
1634-1634: Same suggestion: add_SLACK_GUARD_SCRIPTto root-path validation.For consistency with the non-root path recommendation.
♻️ Suggested fix
-validate_tmp_permissions "$_SANDBOX_SAFETY_NET" "$_PROXY_FIX_SCRIPT" "$_NEMOTRON_FIX_SCRIPT" "$_CIAO_GUARD_SCRIPT" +validate_tmp_permissions "$_SANDBOX_SAFETY_NET" "$_PROXY_FIX_SCRIPT" "$_NEMOTRON_FIX_SCRIPT" "$_CIAO_GUARD_SCRIPT" "$_SLACK_GUARD_SCRIPT"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/nemoclaw-start.sh` at line 1634, The call to validate_tmp_permissions is missing the _SLACK_GUARD_SCRIPT argument, so update the invocation of validate_tmp_permissions to include _SLACK_GUARD_SCRIPT alongside the existing arguments (_SANDBOX_SAFETY_NET, _PROXY_FIX_SCRIPT, _NEMOTRON_FIX_SCRIPT, _CIAO_GUARD_SCRIPT) so the root-path validation covers the Slack guard script as well.
1493-1493: Consider adding_SLACK_GUARD_SCRIPTto validation.The Slack guard (
/tmp/nemoclaw-slack-channel-guard.js) is a trust-boundary file loaded viaNODE_OPTIONS, but it's not passed tovalidate_tmp_permissions. Since the validator skips non-existent files, adding it is safe even when Slack isn't configured.♻️ Suggested fix
- validate_tmp_permissions "$_SANDBOX_SAFETY_NET" "$_PROXY_FIX_SCRIPT" "$_NEMOTRON_FIX_SCRIPT" "$_CIAO_GUARD_SCRIPT" + validate_tmp_permissions "$_SANDBOX_SAFETY_NET" "$_PROXY_FIX_SCRIPT" "$_NEMOTRON_FIX_SCRIPT" "$_CIAO_GUARD_SCRIPT" "$_SLACK_GUARD_SCRIPT"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/nemoclaw-start.sh` at line 1493, Call validate_tmp_permissions with the Slack guard variable as well: add _SLACK_GUARD_SCRIPT to the existing invocation that currently passes _SANDBOX_SAFETY_NET, _PROXY_FIX_SCRIPT, _NEMOTRON_FIX_SCRIPT, and _CIAO_GUARD_SCRIPT; this ensures the trust-bound Slack guard file (referenced via NODE_OPTIONS, _SLACK_GUARD_SCRIPT) is validated too — it's safe to include because validate_tmp_permissions already skips non-existent files.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/lib/sandbox-init.sh`:
- Around line 127-132: The permission validator complains because
validate_tmp_permissions expects /tmp/gateway.log mode 644 but several startup
scripts set it to 600; update the startup chmod calls (references:
agent-runtime.ts around the chmod call, src/nemoclaw.ts at the chmod line, and
agents/hermes/start.sh occurrences) to set 644 instead of 600 so the file modes
match the validator, or alternatively modify the validate_tmp_permissions logic
to accept both "600" and "644" for gateway.log; pick one approach and change all
referenced locations consistently (agent-runtime.ts, src/nemoclaw.ts,
agents/hermes/start.sh, or validate_tmp_permissions) so the validator no longer
flags a mismatch.
---
Nitpick comments:
In `@scripts/nemoclaw-start.sh`:
- Line 1634: The call to validate_tmp_permissions is missing the
_SLACK_GUARD_SCRIPT argument, so update the invocation of
validate_tmp_permissions to include _SLACK_GUARD_SCRIPT alongside the existing
arguments (_SANDBOX_SAFETY_NET, _PROXY_FIX_SCRIPT, _NEMOTRON_FIX_SCRIPT,
_CIAO_GUARD_SCRIPT) so the root-path validation covers the Slack guard script as
well.
- Line 1493: Call validate_tmp_permissions with the Slack guard variable as
well: add _SLACK_GUARD_SCRIPT to the existing invocation that currently passes
_SANDBOX_SAFETY_NET, _PROXY_FIX_SCRIPT, _NEMOTRON_FIX_SCRIPT, and
_CIAO_GUARD_SCRIPT; this ensures the trust-bound Slack guard file (referenced
via NODE_OPTIONS, _SLACK_GUARD_SCRIPT) is validated too — it's safe to include
because validate_tmp_permissions already skips non-existent files.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 19af5706-991f-463d-9648-fbac9d3dbad7
📒 Files selected for processing (5)
.github/workflows/nightly-e2e.yamlDockerfilescripts/lib/sandbox-init.shscripts/nemoclaw-start.shtest/e2e/test-messaging-providers.sh
🚧 Files skipped from review as they are similar to previous changes (1)
- test/e2e/test-messaging-providers.sh
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
scripts/lib/sandbox-init.sh (1)
214-214: Keep best-effort behavior, but log lock failures.Line 214 swallows all
chmoderrors silently. That keeps startup resilient, but it also hides hardening regressions. Emit a warning when the lock fails.🛠️ Suggested change
- chmod 444 "$rc_file" 2>/dev/null || true + if ! chmod 444 "$rc_file" 2>/dev/null; then + echo "[SECURITY] Failed to lock ${rc_file} to 444; continuing (best-effort)." >&2 + fi🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/lib/sandbox-init.sh` at line 214, The chmod call currently swallows all errors for "$rc_file" (chmod 444 "$rc_file" 2>/dev/null || true); change it to keep the best-effort behavior but log a warning when the lock fails: run chmod 444 on "$rc_file", capture its exit status, and if it fails emit a warning to stderr (including the rc_file name and the failure status or errno) while still allowing startup to continue; update the block referencing "$rc_file" so failures are visible but non-fatal.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/lib/sandbox-init.sh`:
- Around line 121-132: The trust-boundary mapping in the script is out of sync:
the runtime check in the for-loop sets /tmp/gateway.log expected_perms="644"
(via the case handling for */gateway.log) but the earlier documentation/table
still lists gateway.log as 600; update that trust-boundary map or comment to
state /tmp/gateway.log is mode 644 so the docs match the implemented check
(search for the trust-boundary table/header and the entries referencing
gateway.log and change its mode to 644).
---
Nitpick comments:
In `@scripts/lib/sandbox-init.sh`:
- Line 214: The chmod call currently swallows all errors for "$rc_file" (chmod
444 "$rc_file" 2>/dev/null || true); change it to keep the best-effort behavior
but log a warning when the lock fails: run chmod 444 on "$rc_file", capture its
exit status, and if it fails emit a warning to stderr (including the rc_file
name and the failure status or errno) while still allowing startup to continue;
update the block referencing "$rc_file" so failures are visible but non-fatal.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 17e6f851-bc96-4b7e-b7df-1570d88a71f5
📒 Files selected for processing (3)
scripts/lib/sandbox-init.shscripts/nemoclaw-start.shtest/nemotron-inference-fix.test.ts
✅ Files skipped from review due to trivial changes (1)
- test/nemotron-inference-fix.test.ts
🚧 Files skipped from review as they are similar to previous changes (1)
- scripts/nemoclaw-start.sh
…Slack guard to validation Same issue as the nemotron test — regex was too strict for the new argument ordering. Also add _SLACK_GUARD_SCRIPT to validate_tmp_permissions calls per CodeRabbit review.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
scripts/nemoclaw-start.sh (1)
1267-1293: Handler ordering note: safety net will catch ciao errors first.The safety net is registered before the ciao guard (
--requireorder at lines 1006 and 1296). Node.js invokesuncaughtExceptionhandlers in registration order, so the safety net will swallow ciao errors before this handler runs.This is fine for correctness (both keep the gateway alive), but the ciao-specific logging at lines 1274-1277 and 1283-1286 will never execute. Consider whether the more descriptive ciao-specific messages are important enough to warrant reordering or consolidating the detection logic.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/nemoclaw-start.sh` around lines 1267 - 1293, The ciao-specific uncaughtException logic (checks for 'ERR_SYSTEM_ERROR' with 'uv_interface_addresses' and stack/message checks for 'ciao'/'networkInterfaces' inside the process.on('uncaughtException', ...) handler will never run because the generic safety-net handler is registered earlier; fix by either registering this ciao handler before the safety-net (ensure the process.on('uncaughtException', handler) call that contains the 'uv_interface_addresses' and 'networkInterfaces' checks is invoked prior to the safety-net registration) or merge the ciao detection/logging into the existing safety-net handler by adding the same error checks and the descriptive ciao-specific process.stderr.write messages so the ciao messages are actually emitted.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@scripts/nemoclaw-start.sh`:
- Around line 1267-1293: The ciao-specific uncaughtException logic (checks for
'ERR_SYSTEM_ERROR' with 'uv_interface_addresses' and stack/message checks for
'ciao'/'networkInterfaces' inside the process.on('uncaughtException', ...)
handler will never run because the generic safety-net handler is registered
earlier; fix by either registering this ciao handler before the safety-net
(ensure the process.on('uncaughtException', handler) call that contains the
'uv_interface_addresses' and 'networkInterfaces' checks is invoked prior to the
safety-net registration) or merge the ciao detection/logging into the existing
safety-net handler by adding the same error checks and the descriptive
ciao-specific process.stderr.write messages so the ciao messages are actually
emitted.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: e21f514d-2010-48f4-bd29-633065660bf6
📒 Files selected for processing (2)
scripts/lib/sandbox-init.shscripts/nemoclaw-start.sh
- Update trust-boundary doc table: gateway.log is now 644, not 600 - Log a warning when lock_rc_files chmod fails instead of silent swallow
CodeRabbit review feedback — all addressed
|
There was a problem hiding this comment.
♻️ Duplicate comments (1)
scripts/lib/sandbox-init.sh (1)
127-132:⚠️ Potential issue | 🟠 MajorHandle mixed
gateway.logmodes during rollout to prevent false security failures.On Line 128,
gateway.logis enforced as644only, but current startup paths still set600(from the provided snippets insrc/nemoclaw.tsandsrc/lib/agent-runtime.ts). That creates avoidable validation failures. Either migrate all writers to644in the same rollout, or temporarily accept both600and644here until writers are unified.Possible compatibility patch in this file
- case "$f" in - */gateway.log) expected_perms="644" ;; - *) expected_perms="600" ;; - esac - if [ "$perms" != "$expected_perms" ]; then - echo "[SECURITY] $f has unexpected permissions: mode=$perms (expected $expected_perms)" >&2 - failed=1 - fi + case "$f" in + */gateway.log) + if [ "$perms" != "644" ] && [ "$perms" != "600" ]; then + echo "[SECURITY] $f has unexpected permissions: mode=$perms (expected 600 or 644)" >&2 + failed=1 + fi + ;; + *) + expected_perms="600" + if [ "$perms" != "$expected_perms" ]; then + echo "[SECURITY] $f has unexpected permissions: mode=$perms (expected $expected_perms)" >&2 + failed=1 + fi + ;; + esac🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/lib/sandbox-init.sh` around lines 127 - 132, The current permission check in the case for "*/gateway.log" forces expected_perms="644" causing false failures while some writers still use 600; update the logic in the case that sets expected_perms (the branch that matches "*/gateway.log") to accept both "600" and "644" during rollout — i.e., when f matches gateway.log set expected_perms to a list or adjust the subsequent comparison of perms (the if that compares "$perms" != "$expected_perms") to allow either "600" or "644" for gateway.log so the script tolerates both modes until all writers are migrated.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@scripts/lib/sandbox-init.sh`:
- Around line 127-132: The current permission check in the case for
"*/gateway.log" forces expected_perms="644" causing false failures while some
writers still use 600; update the logic in the case that sets expected_perms
(the branch that matches "*/gateway.log") to accept both "600" and "644" during
rollout — i.e., when f matches gateway.log set expected_perms to a list or
adjust the subsequent comparison of perms (the if that compares "$perms" !=
"$expected_perms") to allow either "600" or "644" for gateway.log so the script
tolerates both modes until all writers are migrated.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: ee918231-b390-41d2-b179-2e40b6a1c50a
📒 Files selected for processing (1)
scripts/lib/sandbox-init.sh
Hermes uses chmod 600 for gateway.log while OpenClaw uses 644. Both share validate_tmp_permissions from sandbox-init.sh. Accept either permission to avoid breaking Hermes's entrypoint.
## Summary Refreshes user-facing docs for the last 24 hours of merged NemoClaw history and bumps the docs metadata to 0.0.29, the next version after v0.0.28. The updates are limited to behavior supported by merged PR descriptions and diffs. ## Changes - `docs/reference/commands.md`: documented `nemoclaw <name> policy-add --from-file` and `--from-dir`, including custom preset review guidance, from #2077 / commit `7720b175`. - `docs/deployment/deploy-to-remote-gpu.md`: clarified that non-loopback `CHAT_UI_URL` disables OpenClaw device pairing for remote browser-only deployments, from #2449 / commit `f5ee8a4d`. - `docs/inference/inference-options.md`: documented provider-aware credential retry validation and the NVIDIA-only `nvapi-` prefix check, from #2389 / commit `6f7f0c6d`. - `docs/inference/switch-inference-providers.md`: documented `NEMOCLAW_INFERENCE_INPUTS` for text/image-capable model metadata baked into `openclaw.json`, from #2441 / commit `f4391892`. - `docs/reference/troubleshooting.md`: added the Git certificate verification entry for proxy CA propagation through `GIT_SSL_CAINFO`, `GIT_SSL_CAPATH`, `CURL_CA_BUNDLE`, and `REQUESTS_CA_BUNDLE`, from #2345 / commit `fa0dc1ab`. - `docs/versions1.json` and `docs/project.json`: promoted docs version `0.0.29`; `docs/versions1.json` omits unpublished `0.0.26`, `0.0.27`, and `0.0.28` entries. - `.agents/skills/nemoclaw-user-*`: regenerated derived user skill references from the updated docs. - Reviewed with no extra doc changes: #2575 / `d392ec07`, #2565 / `a3231049`, #1965 / `db1ef3ca`, #1990 / `db665834`, #2495 / `7da86fa3`, #2496 / `3192f4f4`, #2490 / `8c209058`, #2487 / `1f615e2f`, #2483 / `5653d33a`, #2482 / `31c782c0`, #2464 / `23bb5703`, #2472 / `a54f9a34`, and #2437 / `6bc860d7`. - Skipped per docs policy: #2420 / `7b76df6b` touched the experimental sandbox config path listed in `docs/.docs-skip`; #2466 / `cc15689c` touched a skipped term and CI-only sandbox image files. ## Type of Change - [ ] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [x] Doc only (includes code sample changes) ## Verification <!-- Check each item you ran and confirmed. Leave unchecked items you skipped. --> - [x] `npx prek run --all-files` passes - [ ] `npm test` passes — failed locally in installer-integration tests and one onboard helper timeout; the doc-scoped hook test projects passed under `prek`. - [ ] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [x] Docs updated for user-facing behavior changes - [ ] `make docs` builds without warnings (doc changes only) — build succeeded, but local Sphinx emitted the existing version-switcher file read message. - [x] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) ## AI Disclosure <!-- If an AI agent authored or co-authored this PR, check the box and name the tool. Remove this section for fully human-authored PRs. --> - [x] AI-assisted — tool: Codex --- <!-- DCO sign-off required by CI. Run: git config user.name && git config user.email --> Signed-off-by: Miyoung Choi <miyoungc@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Support for custom YAML presets in policy configuration via --from-file and --from-dir. * New build-time inference input option to declare accepted modalities (text or text,image). * **Improvements** * Credential validation now offers interactive recovery: re-enter key, retry, choose another provider, or exit. * Clarified provider-specific API key prefix handling (nvapi- only applies to NVIDIA keys). * **Documentation** * TLS certificate troubleshooting for inspected networks. * Clarified remote dashboard security/device-pairing behavior; command docs updated; docs version bumped. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>
Summary
Fixes a 5-day outage where the gateway never started in non-root sandbox mode (Brev Launchable, no-new-privileges containers). Also adds a global safety net preventing any npm library crash from killing the gateway.
Changes
Entrypoint Landlock tolerance (
scripts/nemoclaw-start.sh,scripts/lib/sandbox-init.sh)install_configure_guard: wrap all.bashrc/.profilewrites in|| true— the[ -w file ]test passes (DAC) but Landlock blocks the actual write, crashing the entrypoint underset -elock_rc_files:|| trueon chmod callsvalidate_tmp_permissions: expect 644 for gateway.log20407589(Apr 20) addedinstall_configure_guardwhich writes to Landlock-protected files. Every non-root sandbox since then had a dead gateway.Global sandbox safety net (
scripts/nemoclaw-start.sh)sandbox-safety-net.jspreload — catches ALL uncaught exceptions and unhandled rejections in sandbox mode (OPENSHELL_SANDBOX=1), logs them, and continuesprocess.exit()during swallowed rejection delivery so OpenClaw's own handler (which callsprocess.exit(1)for non-transient errors) doesn't kill the gateway--requirepreload so handlers register before any library codeciao network guard (
scripts/nemoclaw-start.sh)@homebridge/ciaomDNS library crash (os.networkInterfaces()→uv_interface_addressesSystemError in restricted namespaces)os.networkInterfacesto return{}on failureSlack guard improvements (
scripts/nemoclaw-start.sh)proxy-env.shfor connect sessionsPer-channel health monitor disable (
Dockerfile)healthMonitor.enabled: falseon each messaging channel accountRemove cloud-experimental-e2e (
.github/workflows/nightly-e2e.yaml)Test plan
Summary by CodeRabbit
Bug Fixes
Chores
Tests