-
Notifications
You must be signed in to change notification settings - Fork 2k
Description
Summary
The HTTP egress proxy inside the sandbox (10.200.0.1:3128, implemented in /opt/openshell/bin/openshell-sandbox) hard-terminates established connections after approximately 2 minutes. This makes any WebSocket-based integration — including the native openclaw Discord plugin, WhatsApp Web (see #361), and any other long-lived TCP connection — non-functional inside the sandbox. There is no configuration option to change this behaviour.
Environment
- Platform: macOS (Apple Silicon, M-series)
openshell-sandbox: running as PID 1 in k3s pod (ghcr.io/nvidia/openshell/gateway:dev)- Cluster: Docker container
openshell-cluster-nemoclaw(k3s via Docker) - NemoClaw version: 2026.3.x
- OpenClaw version: 2026.3.11–2026.3.13
Two Distinct Failure Modes
1. EAI_AGAIN on gateway.discord.gg — permanent DNS failure
The ws library (used by discord.js for the WebSocket leg) does not honour HTTPS_PROXY. It attempts a direct socket connection and requires local DNS resolution of gateway.discord.gg. Inside the sandbox, this hostname never resolves — the cluster DNS (10.43.0.10, CoreDNS) does not forward public hostname resolution for sandbox pods.
The REST client works because it sends CONNECT discord.com:443 to the HTTP proxy, which resolves the hostname server-side. The WebSocket client bypasses the proxy and hits DNS directly, so it always fails.
Symptom: tight infinite loop in the openshell logs —
L7_REQUEST GET /api/v10/oauth2/applications/@me → discord.com:443 allow
L7_REQUEST GET /api/v10/users/@me → discord.com:443 allow
// then EAI_AGAIN on gateway.discord.gg, retry immediately, repeat
2. 2-minute hard connection cutoff — WebSocket dropped even if DNS resolves
If DNS is resolved externally (e.g. via a workaround), the WebSocket connects successfully but is terminated by the proxy after ~2 minutes 13 seconds with WebSocket close code 1006 (abnormal closure). The discord.js client attempts to reconnect; reconnect attempts also fail (same DNS/proxy issue), and the health-monitor eventually restarts the provider from scratch — resulting in ~50% uptime with 2–3 minute outage windows.
Confirmed via gateway log:
[discord] logged in to discord as <id> (Kaveri) ← 03:58:47
[discord] gateway error: AggregateError ← 04:01:00 (+2m13s)
[discord] gateway: WebSocket connection closed (1006)
[discord] gateway: Reconnecting with backoff: 1000ms
Root Cause
The cutoff lives inside /opt/openshell/bin/openshell-sandbox — the compiled Rust binary running as PID 1 in the sandbox k3s pod. Inspected via:
docker exec openshell-cluster-nemoclaw kubectl exec -n openshell kaveri \
-- /opt/openshell/bin/openshell-sandbox --helpThe --help output exposes no proxy timeout flag, no --connect-timeout, no env var, no config file. The k3s pod spec for the sandbox container has no relevant environment variables. The timeout is hardcoded in the binary.
Impact
Any openclaw plugin or user workload that requires a persistent TCP/WebSocket connection is broken inside the sandbox:
- openclaw native Discord plugin
- WhatsApp Web (confirmed in WhatsApp Web QR code not generated in NemoClaw sandbox (WebSocket exception) #361 — same 408 timeout on WS opening handshake)
- Any long-running HTTPS stream or SSE connection
Current Workaround
Run the Discord client on the host (outside the sandbox), forwarding messages to the agent via SSH + nemoclaw-start openclaw agent. This mirrors the existing Telegram bridge architecture. It works, but duplicates logic the openclaw gateway already handles natively and requires maintaining a parallel gateway layer.
Proposed Fix
Either of the following in openshell-sandbox:
- Remove or increase the hard cutoff for CONNECT tunnel connections (or make it configurable via env var or sandbox policy).
- Add a
proxy: directmode to the sandbox network policy schema, allowing specific endpoints (e.g.gateway.discord.gg) to bypass the HTTP proxy entirely — directly through the network namespace if routing allows, or via a separate non-timeout-limited tunnel.
Option 2 is preferable long-term as it gives per-endpoint control without globally relaxing the proxy timeout.
Related
- Proposal: Add Discord bridge (host-side, similar to Telegram bridge) #213 — Discord bridge proposal (workaround)
- WhatsApp Web QR code not generated in NemoClaw sandbox (WebSocket exception) #361 — WhatsApp Web WebSocket failure (same root cause)
- OpenClaw gateway inside sandbox requires manual start after every reboot #311 — gateway manual restart after reboot (related: sandbox process lifecycle)