Supervisor's musl DNS resolver fails for external domains in Kubernetes (ndots:5)

## Agent Diagnostic

- Investigated inference 503 errors in a Kind cluster (Podman-backed) running OpenShell gateway v0.0.72 with Vertex AI provider
- Traced the failure through `openshell-supervisor-network/src/proxy.rs` → `openshell-router` → `reqwest` → `hyper_util::client::legacy::connect::http::Http::connect` → musl `getaddrinfo`
- Confirmed via supervisor logs that DNS resolution returns zero usable addresses for `aiplatform.googleapis.com`:
  ```
  TRACE hyper_util::client::legacy::connect::http: Http::connect; scheme=Some("https"), host=Some("aiplatform.googleapis.com"), port=None
  TRACE hyper_util::client::legacy::pool: checkout dropped for ("https", aiplatform.googleapis.com)
  ```
  No `connecting to X.X.X.X:443` log appears between `Http::connect` and `checkout dropped`, confirming DNS returns no addresses.
- Verified glibc's `getaddrinfo` (via Python in the same pod) resolves the same domain successfully (33 IPv4 results)
- Confirmed the supervisor binary is statically linked with musl (`ldd` reports "not a dynamic executable")
- Verified cluster-internal DNS works from musl (gateway hostname resolves fine) — only external FQDNs fail
- Root cause: musl's `getaddrinfo` sends A and AAAA queries simultaneously on the same UDP socket. With Kubernetes' default `ndots:5` and 5 search domains, `aiplatform.googleapis.com` (3 dots < 5) gets expanded through all search domains first — 12+ concurrent queries. The responses arrive out of order, causing musl to mishandle them and return zero usable addresses.
- Workaround confirmed: patching `/etc/resolv.conf` to `ndots:1` before the first inference request resolves the issue. New sandboxes with `ndots:1` succeed; sandboxes where the supervisor has already cached a DNS failure continue to fail even after patching.

## Description

The supervisor's router component fails to resolve external DNS names (e.g., `aiplatform.googleapis.com`) when running in Kubernetes, causing all inference requests to fail with 503 "inference service unavailable".

**Root cause:** The supervisor is statically linked with musl libc (required for `scratch` base image portability). musl's `getaddrinfo` implementation sends A and AAAA DNS queries simultaneously on a single UDP socket. In Kubernetes, `/etc/resolv.conf` defaults to `ndots:5` with 5 search domains. For external FQDNs with fewer than 5 dots, musl expands through all search domains first, generating 12+ concurrent queries. The interleaved responses cause musl's resolver to return zero usable addresses.

**Expected:** Inference requests to external providers (Vertex AI, etc.) succeed.

**Actual:** Every inference request fails with `{"error":"inference service unavailable"}`. The supervisor logs show `Http::connect` followed immediately by `checkout dropped` with no TCP connection attempt.

## Reproduction Steps

1. Deploy OpenShell gateway in a Kubernetes cluster (tested with Kind on Podman)
2. Configure a Vertex AI provider with valid credentials
3. Create a sandbox and send an inference request:
   ```
   openshell sandbox create --name test
   kubectl exec test -- curl -sk --proxy http://10.200.0.1:3128 \
     -X POST https://inference.local/v1/messages \
     -H "Content-Type: application/json" \
     -H "x-api-key: unused" \
     -H "anthropic-version: 2023-06-01" \
     -d '{"model":"claude-sonnet-4-6","max_tokens":5,"messages":[{"role":"user","content":"hi"}]}'
   ```
4. Observe: `{"error":"inference service unavailable"}`
5. Apply workaround and verify it fixes the issue:
   ```
   kubectl exec test -- sed -i 's/ndots:5/ndots:1/' /etc/resolv.conf
   # Re-run the curl command above — succeeds (must be a fresh sandbox, not one that already failed)
   ```

## Environment

- OS: Fedora 44 (Linux 7.0.12)
- Container runtime: Podman (Kind with `KIND_EXPERIMENTAL_PROVIDER=podman`)
- OpenShell: v0.0.72
- Kubernetes: Kind v0.27.0 (default `ndots:5`, 5 search domains)
- Provider: Vertex AI (google-vertex-ai), region: global

## Logs

Supervisor logs showing the failure sequence (RUST_LOG=trace):

```
05:40:28.443Z OCSF NET:OPEN [INFO] ALLOWED inference.local:443
05:40:28.443Z INFO openshell_router: routing proxy inference request (streaming)
05:40:28.443Z TRACE hyper_util::client::legacy::pool: checkout waiting for idle connection: ("https", aiplatform.googleapis.com)
05:40:28.443Z DEBUG log: starting new connection: https://aiplatform.googleapis.com/
05:40:28.443Z TRACE hyper_util::client::legacy::connect::http: Http::connect; scheme=Some("https"), host=Some("aiplatform.googleapis.com"), port=None
05:40:28.444Z TRACE hyper_util::client::legacy::pool: checkout dropped for ("https", aiplatform.googleapis.com)
05:40:28.444Z TRACE log: shouldn't retry!
05:40:28.444Z OCSF NET:FAIL [LOW] inference.local:443
```

Note: no `connecting to X.X.X.X:443` log between `Http::connect` and `checkout dropped` — DNS resolution returned zero addresses.

## Suggested Fix

Switch from musl's `getaddrinfo` to a pure-Rust DNS resolver by enabling the `hickory-dns` feature on reqwest. In workspace `Cargo.toml`:

```diff
-reqwest = { version = "0.12", default-features = false, features = ["json", "rustls-tls-native-roots"] }
+reqwest = { version = "0.12", default-features = false, features = ["json", "rustls-tls-native-roots", "hickory-dns"] }
```

This bypasses musl's `getaddrinfo` entirely, eliminating the A+AAAA concurrent query issue regardless of `ndots` configuration. It also avoids pulling in the system libc for DNS, which aligns with the supervisor's static-binary design goal.

Alternatively, setting `dnsConfig.options: [{name: ndots, value: "1"}]` on sandbox pod specs would fix Kubernetes deployments specifically.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Supervisor's musl DNS resolver fails for external domains in Kubernetes (ndots:5) #2052

Agent Diagnostic

Description

Reproduction Steps

Environment

Logs

Suggested Fix

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Supervisor's musl DNS resolver fails for external domains in Kubernetes (ndots:5) #2052

Description

Agent Diagnostic

Description

Reproduction Steps

Environment

Logs

Suggested Fix

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions