Skip to content

Supervisor's musl DNS resolver fails for external domains in Kubernetes (ndots:5) #2052

Description

@bsquizz

Agent Diagnostic

  • Investigated inference 503 errors in a Kind cluster (Podman-backed) running OpenShell gateway v0.0.72 with Vertex AI provider
  • Traced the failure through openshell-supervisor-network/src/proxy.rsopenshell-routerreqwesthyper_util::client::legacy::connect::http::Http::connect → musl getaddrinfo
  • Confirmed via supervisor logs that DNS resolution returns zero usable addresses for aiplatform.googleapis.com:
    TRACE hyper_util::client::legacy::connect::http: Http::connect; scheme=Some("https"), host=Some("aiplatform.googleapis.com"), port=None
    TRACE hyper_util::client::legacy::pool: checkout dropped for ("https", aiplatform.googleapis.com)
    
    No connecting to X.X.X.X:443 log appears between Http::connect and checkout dropped, confirming DNS returns no addresses.
  • Verified glibc's getaddrinfo (via Python in the same pod) resolves the same domain successfully (33 IPv4 results)
  • Confirmed the supervisor binary is statically linked with musl (ldd reports "not a dynamic executable")
  • Verified cluster-internal DNS works from musl (gateway hostname resolves fine) — only external FQDNs fail
  • Root cause: musl's getaddrinfo sends A and AAAA queries simultaneously on the same UDP socket. With Kubernetes' default ndots:5 and 5 search domains, aiplatform.googleapis.com (3 dots < 5) gets expanded through all search domains first — 12+ concurrent queries. The responses arrive out of order, causing musl to mishandle them and return zero usable addresses.
  • Workaround confirmed: patching /etc/resolv.conf to ndots:1 before the first inference request resolves the issue. New sandboxes with ndots:1 succeed; sandboxes where the supervisor has already cached a DNS failure continue to fail even after patching.

Description

The supervisor's router component fails to resolve external DNS names (e.g., aiplatform.googleapis.com) when running in Kubernetes, causing all inference requests to fail with 503 "inference service unavailable".

Root cause: The supervisor is statically linked with musl libc (required for scratch base image portability). musl's getaddrinfo implementation sends A and AAAA DNS queries simultaneously on a single UDP socket. In Kubernetes, /etc/resolv.conf defaults to ndots:5 with 5 search domains. For external FQDNs with fewer than 5 dots, musl expands through all search domains first, generating 12+ concurrent queries. The interleaved responses cause musl's resolver to return zero usable addresses.

Expected: Inference requests to external providers (Vertex AI, etc.) succeed.

Actual: Every inference request fails with {"error":"inference service unavailable"}. The supervisor logs show Http::connect followed immediately by checkout dropped with no TCP connection attempt.

Reproduction Steps

  1. Deploy OpenShell gateway in a Kubernetes cluster (tested with Kind on Podman)
  2. Configure a Vertex AI provider with valid credentials
  3. Create a sandbox and send an inference request:
    openshell sandbox create --name test
    kubectl exec test -- curl -sk --proxy http://10.200.0.1:3128 \
      -X POST https://inference.local/v1/messages \
      -H "Content-Type: application/json" \
      -H "x-api-key: unused" \
      -H "anthropic-version: 2023-06-01" \
      -d '{"model":"claude-sonnet-4-6","max_tokens":5,"messages":[{"role":"user","content":"hi"}]}'
    
  4. Observe: {"error":"inference service unavailable"}
  5. Apply workaround and verify it fixes the issue:
    kubectl exec test -- sed -i 's/ndots:5/ndots:1/' /etc/resolv.conf
    # Re-run the curl command above — succeeds (must be a fresh sandbox, not one that already failed)
    

Environment

  • OS: Fedora 44 (Linux 7.0.12)
  • Container runtime: Podman (Kind with KIND_EXPERIMENTAL_PROVIDER=podman)
  • OpenShell: v0.0.72
  • Kubernetes: Kind v0.27.0 (default ndots:5, 5 search domains)
  • Provider: Vertex AI (google-vertex-ai), region: global

Logs

Supervisor logs showing the failure sequence (RUST_LOG=trace):

05:40:28.443Z OCSF NET:OPEN [INFO] ALLOWED inference.local:443
05:40:28.443Z INFO openshell_router: routing proxy inference request (streaming)
05:40:28.443Z TRACE hyper_util::client::legacy::pool: checkout waiting for idle connection: ("https", aiplatform.googleapis.com)
05:40:28.443Z DEBUG log: starting new connection: https://aiplatform.googleapis.com/
05:40:28.443Z TRACE hyper_util::client::legacy::connect::http: Http::connect; scheme=Some("https"), host=Some("aiplatform.googleapis.com"), port=None
05:40:28.444Z TRACE hyper_util::client::legacy::pool: checkout dropped for ("https", aiplatform.googleapis.com)
05:40:28.444Z TRACE log: shouldn't retry!
05:40:28.444Z OCSF NET:FAIL [LOW] inference.local:443

Note: no connecting to X.X.X.X:443 log between Http::connect and checkout dropped — DNS resolution returned zero addresses.

Suggested Fix

Switch from musl's getaddrinfo to a pure-Rust DNS resolver by enabling the hickory-dns feature on reqwest. In workspace Cargo.toml:

-reqwest = { version = "0.12", default-features = false, features = ["json", "rustls-tls-native-roots"] }
+reqwest = { version = "0.12", default-features = false, features = ["json", "rustls-tls-native-roots", "hickory-dns"] }

This bypasses musl's getaddrinfo entirely, eliminating the A+AAAA concurrent query issue regardless of ndots configuration. It also avoids pulling in the system libc for DNS, which aligns with the supervisor's static-binary design goal.

Alternatively, setting dnsConfig.options: [{name: ndots, value: "1"}] on sandbox pod specs would fix Kubernetes deployments specifically.

Metadata

Metadata

Assignees

No one assigned

    Labels

    state:triage-neededOpened without agent diagnostics and needs triage

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions