Skip to content

DNS proxy in cluster-entrypoint.sh fails silently on Linux with systemd-resolved #437

@jeremy-newhouse

Description

@jeremy-newhouse

Summary

The setup_dns_proxy() function in deploy/docker/cluster-entrypoint.sh writes nameserver <container-IP> to /etc/rancher/k3s/resolv.conf and sets up iptables PREROUTING DNAT rules to forward port 53 to Docker's internal DNS (127.0.0.11). However, the DNAT rules are ineffective for builds running inside k3s, causing all image builds (and any DNS resolution during builds) to fail with "Temporary failure resolving" errors.

Environment

  • Host OS: Ubuntu with systemd-resolved (default on Ubuntu 20.04+, including DGX Spark)
  • OpenShell CLI: openshell 0.0.10
  • Image: ghcr.io/nvidia/openshell/cluster:0.0.10

Steps to Reproduce

  1. On a Linux host with systemd-resolved, start an OpenShell gateway
  2. Create a sandbox with a Dockerfile that runs apt-get update
  3. The build fails:
    Err:1 http://deb.debian.org/debian bookworm InRelease
      Temporary failure resolving 'deb.debian.org'
    

Root Cause

The setup_dns_proxy() function:

  1. Discovers Docker DNS ports from iptables DOCKER_OUTPUT chain
  2. Gets the container's eth0 IP (e.g., 172.18.0.2)
  3. Adds iptables PREROUTING DNAT rules to forward :53127.0.0.11:<high-port>
  4. Writes nameserver 172.18.0.2 to /etc/rancher/k3s/resolv.conf
  5. Verifies DNS from the container's own namespace — this succeeds
  6. But k3s builds (containerd) run in a different network context where the PREROUTING DNAT rules don't apply
  7. DNS queries to 172.18.0.2:53 get "connection refused"

The fallback to 8.8.8.8/8.8.4.4 only triggers if setup_dns_proxy returns non-zero, but it can "succeed" (write the rules, pass self-verification) even though the rules don't work for k3s pods/builds.

Workaround

docker exec <gateway-container> sh -c 'echo "nameserver 8.8.8.8" > /etc/rancher/k3s/resolv.conf'

Suggested Fix

After setting up the iptables DNAT rules, verify DNS from a network namespace that mirrors k3s pod networking (not just the container's own namespace). If verification fails, fall back to public DNS (8.8.8.8/8.8.4.4).

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions