Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CNI] Worker nodes is fetching only first DNS server IP from custom DNS list #713

Closed
vakalapa opened this issue Oct 29, 2020 · 8 comments · Fixed by #709
Closed

[CNI] Worker nodes is fetching only first DNS server IP from custom DNS list #713

vakalapa opened this issue Oct 29, 2020 · 8 comments · Fixed by #709

Comments

@vakalapa
Copy link
Contributor

vakalapa commented Oct 29, 2020

What happened:
There are 2 kinds of custom DNS server settings in Azure,

  1. From the Azure VNET custom DNS servers list. This will show up as Global DNS name servers in system-resolv like below and CNI is NOT reading this information.
Global
     DNS Servers: 168.63.129.16
                  8.8.8.8
                  1.1.1.1
     DNS Domain: reddog.microsoft.com
     DNSSEC NTA: 10.in-addr.arpa 
          .....
  1. From the Azure VM Network interface custom DNS servers list. This will show up in eth0 systemd-resolve, even though CNI is reading this info, it is expecting in a certain format than what system-resolv is printing.
Link 2 (eth0)
       Current Scopes: none
       LLMNR setting: yes
       MulticastDNS setting: no
       DNSSEC setting: no
       DNSSEC supported: no
       DNS Servers: 168.63.129.16
                    8.8.8.8
                    1.1.1.1
        DNS Domain: reddog.microsoft.com

CNI is expecting the DNS servers to be printed as below:

DNS Servers:   168.63.129.16
DNS Servers:   8.8.8.8
DNS Servers:   1.1.1.1

This section needs to be updated:

for _, line := range lineArr {
if strings.Contains(line, dnsServersStr) {
dnsServerSplit := strings.Split(line, colonDelimiter)
if len(dnsServerSplit) > 1 {
dnsServerSplit[1] = strings.TrimSpace(dnsServerSplit[1])
dnsInfo.Servers = append(dnsInfo.Servers, dnsServerSplit[1])
}
} else if strings.Contains(line, dnsDomainStr) {
dnsDomainSplit := strings.Split(line, colonDelimiter)
if len(dnsDomainSplit) > 1 {
dnsInfo.Suffix = strings.TrimSpace(dnsDomainSplit[1])
}
}
}

What you expected to happen:
Expected behavior is for Azure CNI to read Global DNS servers list & eth0 DNS servers list and configure azure0 with them.

How to reproduce it:

  1. Go to worker node Azure VNET ->Settings -> DNS Servers:
    Edit the option from "Default (Azure-provided)" to "Custom" and add additional DNS servers.
    Worker node will need to be restarted for this change to get applied.

    After the reboot, "systemd-resolve --status|grep 'DNS Servers' -A4 -B4" will show only the first DNS server from above change being applied

Orchestrator and Version (e.g. Kubernetes, Docker):
Kubernetes

Operating System (Linux/Windows):
Linux

Kernel (e.g. uanme -a for Linux or $(Get-ItemProperty -Path "C:\windows\system32\hal.dll").VersionInfo.FileVersion for Windows):
v5.4

@vakalapa vakalapa linked a pull request Oct 30, 2020 that will close this issue
@Kenneth-Abrams
Copy link

Can we get a timeline on this issue being fixed? I believe this is a result of a case I opened with Microsoft, we were severely impacted by this major bug when our domain controller went offline. All our clusters crashed because the worker nodes were not respecting all DNS servers configured in the VNET.

INC: 120101621001721

@Kenneth-Abrams
Copy link

@vakalapa do i need to restart my worker nodes for the change to take effect?

@vakalapa
Copy link
Contributor Author

@Kenneth-Abrams, we are currently discussing on the timeline and we will update soon on release date of new v1.2.0 Azure CNI plugin which will have "transparent" mode as default. In this mode, DNS servers will be updated as expected.

If you want an immediate mitigation, If you use Azure CNI plugin with Calico policy, you can workaround this problem.
Azure CNI + Calico Policy

@Kenneth-Abrams
Copy link

Kenneth-Abrams commented Nov 5, 2020

@vakalapa
@brendandburns
Can you fast track this because it is a major issue for AKS that will cause production impact to customers if that singular DNS server goes down. Currently, all my AKS clusters using CNI are dependent to a singular domain controller because of this bug. As a customer, the recommended fix should not be to leverage CNI with Calico is optional to me.

@Kenneth-Abrams
Copy link

Any word on when this is going to production? I just checked this evening and my clusters are still seeing 1 of 4 configured DNS servers from my VNET.

@vakalapa
Copy link
Contributor Author

vakalapa commented Jan 4, 2021

@Kenneth-Abrams Just before the holidays newer version of CNI was released to production. In a new cluster, you should be able to see these DNS servers work. LMK if you see any issues in newer clusters.

@Kenneth-Abrams
Copy link

@vakalapa What about existing clusters? I shouldn't have to rebuild my environment for something introduced by a MS bug.

@paulgmiller
Copy link
Member

Existing clusters get new cni versions when they upgrade (node image or k8s version upgrade)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants