Skip to content

[Bug] openshell-sandbox egress proxy kills WebSocket connections after ~2 minutes — Discord (and other WS-based plugins) cannot run inside sandbox #409

@subrih

Description

@subrih

Summary

The HTTP egress proxy inside the sandbox (10.200.0.1:3128, implemented in /opt/openshell/bin/openshell-sandbox) hard-terminates established connections after approximately 2 minutes. This makes any WebSocket-based integration — including the native openclaw Discord plugin, WhatsApp Web (see #361), and any other long-lived TCP connection — non-functional inside the sandbox. There is no configuration option to change this behaviour.


Environment

  • Platform: macOS (Apple Silicon, M-series)
  • openshell-sandbox: running as PID 1 in k3s pod (ghcr.io/nvidia/openshell/gateway:dev)
  • Cluster: Docker container openshell-cluster-nemoclaw (k3s via Docker)
  • NemoClaw version: 2026.3.x
  • OpenClaw version: 2026.3.11–2026.3.13

Two Distinct Failure Modes

1. EAI_AGAIN on gateway.discord.gg — permanent DNS failure

The ws library (used by discord.js for the WebSocket leg) does not honour HTTPS_PROXY. It attempts a direct socket connection and requires local DNS resolution of gateway.discord.gg. Inside the sandbox, this hostname never resolves — the cluster DNS (10.43.0.10, CoreDNS) does not forward public hostname resolution for sandbox pods.

The REST client works because it sends CONNECT discord.com:443 to the HTTP proxy, which resolves the hostname server-side. The WebSocket client bypasses the proxy and hits DNS directly, so it always fails.

Symptom: tight infinite loop in the openshell logs —

L7_REQUEST  GET /api/v10/oauth2/applications/@me  → discord.com:443  allow
L7_REQUEST  GET /api/v10/users/@me                → discord.com:443  allow
// then EAI_AGAIN on gateway.discord.gg, retry immediately, repeat

2. 2-minute hard connection cutoff — WebSocket dropped even if DNS resolves

If DNS is resolved externally (e.g. via a workaround), the WebSocket connects successfully but is terminated by the proxy after ~2 minutes 13 seconds with WebSocket close code 1006 (abnormal closure). The discord.js client attempts to reconnect; reconnect attempts also fail (same DNS/proxy issue), and the health-monitor eventually restarts the provider from scratch — resulting in ~50% uptime with 2–3 minute outage windows.

Confirmed via gateway log:

[discord] logged in to discord as <id> (Kaveri)         ← 03:58:47
[discord] gateway error: AggregateError                  ← 04:01:00  (+2m13s)
[discord] gateway: WebSocket connection closed (1006)
[discord] gateway: Reconnecting with backoff: 1000ms

Root Cause

The cutoff lives inside /opt/openshell/bin/openshell-sandbox — the compiled Rust binary running as PID 1 in the sandbox k3s pod. Inspected via:

docker exec openshell-cluster-nemoclaw kubectl exec -n openshell kaveri \
  -- /opt/openshell/bin/openshell-sandbox --help

The --help output exposes no proxy timeout flag, no --connect-timeout, no env var, no config file. The k3s pod spec for the sandbox container has no relevant environment variables. The timeout is hardcoded in the binary.


Impact

Any openclaw plugin or user workload that requires a persistent TCP/WebSocket connection is broken inside the sandbox:


Current Workaround

Run the Discord client on the host (outside the sandbox), forwarding messages to the agent via SSH + nemoclaw-start openclaw agent. This mirrors the existing Telegram bridge architecture. It works, but duplicates logic the openclaw gateway already handles natively and requires maintaining a parallel gateway layer.


Proposed Fix

Either of the following in openshell-sandbox:

  1. Remove or increase the hard cutoff for CONNECT tunnel connections (or make it configurable via env var or sandbox policy).
  2. Add a proxy: direct mode to the sandbox network policy schema, allowing specific endpoints (e.g. gateway.discord.gg) to bypass the HTTP proxy entirely — directly through the network namespace if routing allows, or via a separate non-timeout-limited tunnel.

Option 2 is preferable long-term as it gives per-endpoint control without globally relaxing the proxy timeout.


Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    Integration: DiscordUse this label to identify Discord bot integration issues with NemoClaw.OpenShellSupport for OpenShell, a safe, private runtime for autonomous AI agentsPlatform: MacOSSupport for MacOSbugSomething isn't workingpriority: mediumIssue that should be addressed in upcoming releases

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions