Skip to content

Network ‐ Verify DNS reachability

kchennag edited this page May 18, 2026 · 2 revisions

Description

This rule verifies that DNS servers are functioning correctly and reachable from all cluster nodes. The rule first checks for upstream DNS resolvers configured in the OpenShift DNS operator. If no upstream resolvers are found, it falls back to checking /etc/resolv.conf on each node. The rule then tests DNS resolution by querying each DNS server from every node and reports any DNS servers that fail to resolve queries.

Severity: High - DNS resolution is critical for cluster operations

What is checked:

  • OpenShift DNS operator upstream resolver configuration (dns.operator.openshift.io/cluster)
  • Nameserver entries in /etc/resolv.conf on each node (if no upstream resolvers configured)
  • DNS resolution functionality using dig command from all nodes
  • Rule passes if DNS servers can resolve queries from at least one node

Prerequisites

  • OpenShift cluster with network connectivity
  • dig command available on nodes
  • DNS traffic (UDP/TCP port 53) allowed between nodes and DNS servers
  • Read access to DNS operator configuration (for upstream resolver check)

Impact

DNS servers that cannot resolve queries can cause severe cluster disruption:

  • Pod startup failures - Pods cannot resolve service names or external hostnames during initialization
  • Service discovery failures - Kubernetes service DNS lookups fail, breaking inter-pod communication
  • Image pull failures - Cannot resolve container registry hostnames (e.g., quay.io, registry.redhat.io)
  • Operator failures - OpenShift operators cannot reach external APIs or update channels
  • Application errors - Applications cannot resolve external dependencies (databases, APIs, webhooks)
  • Cluster upgrades blocked - Cannot reach Red Hat update servers
  • Certificate renewal failures - ACME challenges and external CA validation fail
  • Node NotReady state - kubelet and container runtime may report issues if DNS is unavailable

Critical Note: Even if some DNS servers are functional, having non-functional servers can cause intermittent failures and increased latency due to DNS timeout retries.

Root Cause

DNS servers may fail to resolve queries due to:

  • DNS server issues

    • DNS server process stopped or crashed
    • DNS server overloaded or unresponsive
    • DNS server host powered off or rebooted
    • DNS server configuration errors preventing resolution
  • Network connectivity issues

    • Firewall rules blocking DNS traffic (UDP/TCP port 53)
    • Network interface down on DNS server
    • Routing problems between cluster nodes and DNS servers
    • Network partition or split-brain scenario
  • Configuration issues

    • Incorrect DNS server IP addresses in configuration
    • Stale DNS configuration after infrastructure changes
    • Mismatch between upstream resolvers and actual DNS infrastructure
  • Infrastructure changes

    • DNS servers migrated to new IPs without updating cluster config
    • DNS infrastructure decommissioned
    • Network topology changes (VLAN, subnet, gateway changes)

Diagnostics

Check DNS Configuration

Check upstream DNS resolvers in DNS operator:

# Check if upstream DNS resolvers are configured
oc get dns.operator.openshift.io/cluster -o jsonpath='{.spec.upstreamResolvers}'

# View full DNS operator configuration
oc get dns.operator.openshift.io/cluster -o yaml
Check /etc/resolv.conf on nodes:


# View DNS configuration on a specific node
oc debug node/<node-name>
chroot /host
cat /etc/resolv.conf
Test DNS Resolution
Test DNS resolution using dig (what the rule uses):


# On each node, test DNS resolution with dig
oc debug node/<node-name>
chroot /host

# Test DNS resolution (this is what the rule actually checks)
dig +short +time=2 +tries=1 @<dns-server-ip> google.com

# Expected: Should return an IP address like 142.251.41.174
# If empty or error: DNS server cannot resolve queries
Test network connectivity to DNS servers:


# Ping IPv4 DNS server (tests network connectivity only)
ping -c 3 -W 2 <dns-server-ip>

# Ping IPv6 DNS server
ping -6 -c 3 -W 2 <dns-server-ipv6>
Test DNS queries with additional tools:


# Test DNS query using dig with verbose output
dig @<dns-server-ip> google.com

# Test DNS query using nslookup
nslookup google.com <dns-server-ip>

# Test DNS query using host
host google.com <dns-server-ip>
Check network connectivity:


# Check routing to DNS server
oc debug node/<node-name>
chroot /host
ip route get <dns-server-ip>

# Check if DNS port 53 is reachable (TCP)
nc -zv <dns-server-ip> 53

# Check if DNS port 53 is reachable (UDP)
nc -zvu <dns-server-ip> 53
Check CoreDNS Status
Check OpenShift DNS pods:


# Check DNS pods status
oc get pods -n openshift-dns

# Check DNS pod logs
oc logs -n openshift-dns <dns-pod-name>

# Check if DNS service is working
oc get service -n openshift-dns

# Test DNS from inside cluster
oc run -it --rm debug-dns --image=registry.access.redhat.com/ubi9/ubi:latest --restart=Never -- nslookup kubernetes.default
Solution
1. Fix DNS Server Issues
If DNS servers cannot resolve queries due to server-side problems:

On the DNS server infrastructure (not cluster nodes):


# Restart DNS service (depends on your DNS server - examples):
# For BIND:
systemctl restart named

# For dnsmasq:
systemctl restart dnsmasq

# For systemd-resolved:
systemctl restart systemd-resolved

# Check DNS server logs
journalctl -u named -f    # For BIND
journalctl -u dnsmasq -f  # For dnsmasq
Verify DNS server is resolving queries:


# Test resolution locally on DNS server
dig @localhost google.com

# Check DNS server is listening
netstat -tulpn | grep :53
2. Fix Network Connectivity
If DNS servers are unreachable due to network issues:


# Check firewall rules allow DNS traffic
# On RHEL/RHCOS nodes:
oc debug node/<node-name>
chroot /host

# Check firewall status
systemctl status firewalld
firewall-cmd --list-all

# If needed, add firewall rule (not recommended - fix upstream firewall instead)
# firewall-cmd --permanent --add-service=dns
# firewall-cmd --reload
Better approach: Fix firewall rules on the network infrastructure (routers, firewalls) to allow:

UDP port 53 from cluster nodes to DNS servers
TCP port 53 from cluster nodes to DNS servers
3. Update DNS Configuration
If DNS server IPs are wrong or stale:

Update upstream DNS resolvers in DNS operator:


# Edit DNS operator configuration
oc edit dns.operator.openshift.io/cluster

# Add or update upstream resolvers:
spec:
  upstreamResolvers:
    upstreams:
    - type: Network
      address: 192.168.1.1  # Your correct DNS server IP
      port: 53
    - type: Network
      address: 8.8.8.8      # Backup DNS server
      port: 53
Update /etc/resolv.conf via MachineConfig (if needed):


apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: 99-worker-dns
spec:
  config:
    ignition:
      version: 3.2.0
    storage:
      files:
      - path: /etc/resolv.conf
        mode: 0644
        overwrite: true
        contents:
          inline: |
            search cluster.local
            nameserver 192.168.1.1
            nameserver 8.8.8.8
Warning: Only use MachineConfig for DNS changes if absolutely necessary, as it will reboot nodes. Prefer configuring upstream resolvers in the DNS operator.

4. Restart CoreDNS
If internal cluster DNS is the issue:

Restart CoreDNS in OpenShift:


# Restart DNS pods
oc delete pods -n openshift-dns --all

# Wait for new pods to start
oc get pods -n openshift-dns -w
5. Verify Fix
After applying fixes, verify DNS is working:


# Re-run the in-cluster-checks rule
in-cluster-checks --debug-rule verify_dns_reachability

# Or test manually from a node
oc debug node/<node-name>
chroot /host
dig +short @<dns-server-ip> google.com
nslookup google.com
Resources
[OpenShift DNS Operator Documentation](https://docs.openshift.com/container-platform/latest/networking/dns-operator.html)
[Configuring DNS forwarding in OpenShift](https://docs.openshift.com/container-platform/latest/networking/dns-operator.html#nw-dns-forward_dns-operator)
[Troubleshooting DNS in OpenShift](https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-network-issues.html#nw-troubleshoot-dns_troubleshooting-network-issues)
[Red Hat KB: Debugging DNS resolution](https://access.redhat.com/solutions/3804501)

Clone this wiki locally