New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The cluster-internal DNS server cannot be used from Windows containers #2027

Open
chweidling opened this Issue Jan 10, 2018 · 228 comments

Comments

@chweidling
Copy link

chweidling commented Jan 10, 2018

Is this a request for help?: NO


Is this an ISSUE or FEATURE REQUEST? (choose one): ISSUE


What version of acs-engine?: canary, GitCommit 8fd4ac4


Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm) kubernetes 1.8.6

What happened:

I deployed a simple cluster with one master node and two Windows nodes. In this deployment, requests to the cluster's own DNS server kubedns time out. Requests to DNS servers work.

Remark: This issue is somehow related to #558 and #1949. The related issues suggest that the DNS problems have a relation to the Windows dnscache service or to the custom VNET feature. But the following description points to a different direction.

What you expected to happen: Requests to the internal DNS server should not time out.

Steps to reproduce:

Deploy a simple kubernetes cluster with one master node and two Windows nodes with the following api model:

{
  "apiVersion": "vlabs",
  "properties": {
    "orchestratorProfile": {
      "orchestratorType": "Kubernetes",
      "kubernetesConfig": {
        "networkPolicy": "none"
      },
      "orchestratorRelease": "1.8"
    },
    "masterProfile": {
      "count": 1,
      "dnsPrefix": "---",
      "vmSize": "Standard_D4s_v3"
    },
    "agentPoolProfiles": [
      {
        "name": "backend",
        "count": 2,
        "osType": "Windows",
        "vmSize": "Standard_D4s_v3",
        "availabilityProfile": "AvailabilitySet"
      }      
    ],
    "windowsProfile": {
      "adminUsername": "---",
      "adminPassword": "---"
    },
    "linuxProfile": {
      "adminUsername": "weidling",
      "ssh": {
        "publicKeys": [
          {
            "keyData": "ssh-rsa ---"
          }
        ]
      }
    },
    "servicePrincipalProfile": {
      "clientId": "---",
      "secret": "---"
    }
  }
}

Then run a Windows container. I used the following command: kubectl run mycore --image microsoft/windowsservercore:1709 -it powershell

Then run the following nslookup session, where you try to resolve a DNS entry with the default (internal) DNS server and then with Google's DNS server:

PS C:\> nslookup
DNS request timed out.
    timeout was 2 seconds.
Default Server:  UnKnown
Address:  10.0.0.10

> github.com
Server:  UnKnown
Address:  10.0.0.10

DNS request timed out.
    timeout was 2 seconds. 
(repeats 3 more times)
*** Request to UnKnown timed-out

> server 8.8.8.8
DNS request timed out.
    timeout was 2 seconds.
Default Server:  [8.8.8.8]
Address:  8.8.8.8

> github.com
Server:  [8.8.8.8]
Address:  8.8.8.8

Non-authoritative answer:
Name:    github.com
Addresses:  192.30.253.113
          192.30.253.112

> exit

Anything else we need to know:
As suggested in #558, the problem should vanish 15 minutes after a pod has started. In my deployment, the problem does not disapper even after one hour.

I observed the behavior independent from the values of the networkPolicy (none, azure) and orchestratorRelease (1.7, 1.8, 1.9) properties in the api model. With the model above, I get the following network configuration inside the Windows pod:

PS C:\> ipconfig /all

Windows IP Configuration

   Host Name . . . . . . . . . . . . : mycore-96fdd75dc-8g5kd
   Primary Dns Suffix  . . . . . . . :
   Node Type . . . . . . . . . . . . : Hybrid
   IP Routing Enabled. . . . . . . . : No
   WINS Proxy Enabled. . . . . . . . : No

Ethernet adapter vEthernet (9519cc22abb5ef39c786c5fbdce98c6a23be5ff1dced650ed9e338509db1eb35_l2bridge):

   Connection-specific DNS Suffix  . :
   Description . . . . . . . . . . . : Hyper-V Virtual Ethernet Adapter #3
   Physical Address. . . . . . . . . : 00-15-5D-87-0F-CC
   DHCP Enabled. . . . . . . . . . . : No
   Autoconfiguration Enabled . . . . : Yes
   Link-local IPv6 Address . . . . . : fe80::a58c:aaf:c12b:d82c%21(Preferred)
   IPv4 Address. . . . . . . . . . . : 10.244.2.92(Preferred)
   Subnet Mask . . . . . . . . . . . : 255.255.255.0
   Default Gateway . . . . . . . . . : 10.240.0.1
   DNS Servers . . . . . . . . . . . : 10.0.0.10
   NetBIOS over Tcpip. . . . . . . . : Disabled
@brunsgaard

This comment has been minimized.

Copy link

brunsgaard commented Jan 10, 2018

@chweidling I will just let you know that you are not alone. My team and I have been batteling this today day with no luck at all. I think @JiangtianLi is looking into it (or at least similar issues). A quick search and look around the issues, shows that there are multiple problems with windows DNS and network right now.

@ITler

This comment has been minimized.

Copy link

ITler commented Jan 11, 2018

I face an issue which sounds similar. I'm on AzureCloudGermany. However, I've troubles with linux-based (Ubuntu, Debian, Alpine) containers when it comes to DNS resolution, but only with multi-agent cluster. When only having one k8s agent node, this seems not to be a problem.
Should I open up a separate github issue for that as this refers to Windows containers?

@cpunella

This comment has been minimized.

Copy link

cpunella commented Jan 11, 2018

Hi,

we are facing the same issue described from @chweidling . We have an hybrid cluster with both linux and windows nodes and only the windows node suffers to this problem.

@ITler yes, it seems that your issue is different... maybe it is better open a new issue ;)

@4c74356b41

This comment has been minimized.

Copy link

4c74356b41 commented Jan 11, 2018

I can confirm I'm seeing the exact behavior, dns doesnt work on kubernetes containers (if i créate container on the node using docker it Works)

@JiangtianLi

This comment has been minimized.

Copy link
Contributor

JiangtianLi commented Jan 11, 2018

@ITler Is your multi-agent cluster Linux only or hybrid? If it is Linux only, please file a different issue.

@Josefczak

This comment has been minimized.

Copy link

Josefczak commented Jan 12, 2018

Maybe this helps in diagnosing the issue:
I was able to get the pods working by changing the dns entry from ClusterIP to one of the dns pod IPs.
netsh interface ip show config
netsh interface ip set dns "****" static 10.244.0.3

@ghost

This comment has been minimized.

Copy link

ghost commented Jan 17, 2018

Nice catch @Josefczak !
On our side, we also added the DNS suffix to let Windows containers to resolve short service names thanks to these Powershell commands:

$adapter=Get-NetAdapter
Set-DnsClientServerAddress -InterfaceIndex $adapter.ifIndex -ServerAddresses 10.244.0.2,10.244.0.3
Set-DnsClient -InterfaceIndex $adapter.ifIndex -ConnectionSpecificSuffix "default.svc.cluster.local"
@rbankole

This comment has been minimized.

Copy link

rbankole commented Feb 1, 2018

Josefczak thanks!

@esheris

This comment has been minimized.

Copy link

esheris commented Feb 3, 2018

I don't have much to add here. I just came across this after much searching. I have the same timeout issue connecting to 10.0.0.10 using nslookup. While setting the containers dns is a solution, having to muck about with my container entrypoint to work around this issue doesn't seem like the greatest solution. Fortunately we are still in an early testing phase. Is there a bug to track somewhere for this specific issue?

@4c74356b41

This comment has been minimized.

Copy link

4c74356b41 commented Feb 3, 2018

@esheris I guess you are looking at it

@brobichaud

This comment has been minimized.

Copy link

brobichaud commented Feb 6, 2018

Oh man, this solution @Josefczak found is what I've been looking for literally for 2 months. :-) I can get this to work if I manually connect to my pods, but am struggling with the dockerfile commands to automate this. Can anyone offer nanoserver dockerfile commands that work? (ie: no powershell required!)

@JiangtianLi

This comment has been minimized.

@brobichaud

This comment has been minimized.

Copy link

brobichaud commented Feb 6, 2018

Yeah I did see that in the thread above but the problem is that it requires the interface name, which appears to be unique to the pod. Surely someone has already automated this in a dockerfile. This is a HUGE fix for a longstanding DNS issue in 1709 for me.

@esheris

This comment has been minimized.

Copy link

esheris commented Feb 6, 2018

The only solution I can come up with is to modify my containers entrypoint to be a powershell script that runs the above commands then executes what I want to really run, in my case I ended up having some other things I needed to do with my web.config that now I have my docker file like so:

FROM microsoft/aspnet:4.7.1-windowsservercore-1709
COPY entrypoint.ps1 .
...
ENTRYPOINT [ "powershell.exe", "c:\\entrypoint.ps1"]

entrypoint.ps1 essentially looks like this

$adapter=Get-NetAdapter
Set-DnsClientServerAddress -InterfaceIndex $adapter.ifIndex -ServerAddresses 10.244.0.2,10.244.0.3
Set-DnsClient -InterfaceIndex $adapter.ifIndex -ConnectionSpecificSuffix "default.svc.cluster.local"
... web.config update ...
c:\ServiceMonitor.exe w3svc
@brobichaud

This comment has been minimized.

Copy link

brobichaud commented Feb 6, 2018

@esheris @JiangtianLi I was able to come up with the PowerShell commands for servercore much like you have (though I put them inline in the dockerfile) but when I deploy my pod the DNS server hasn't changed. I suspect a permissions problem in the dockerfile. It's like it runs the commands but they fail to apply. I can still remote into my pod and manually issue the same commands and then my already running app suddenly starts working. Here is the relevant snippet from my dockerfile:

SHELL ["powershell", "-command"]
RUN "$adapter=Get-NetAdapter; \
	Set-DnsClientServerAddress -InterfaceIndex $adapter.ifIndex -ServerAddresses 10.244.0.2,10.244.0.3;"

Does anyone know how to correctly elevate permissions in the dockerfile for Windows?

@esheris

This comment has been minimized.

Copy link

esheris commented Feb 6, 2018

You can't really do this in the dockerfile directly as the underlying nic of the container will change and you are setting the dns based on it. This is why I had to modify my container entrypoint. you have to set dns when the container starts.

@brobichaud

This comment has been minimized.

Copy link

brobichaud commented Feb 6, 2018

Ahhh, I see. That does explain why it failed to work in my dockerfile. Ugh, your workaround is ingenious but so ugly and feels so hacky. Alas it DOES work, and I thank you @esheris!

A couple of questions maybe you can answer:

  1. I see you start your entrypoint with servicemonitor.exe. If I'm running a dotnet core app can I just run my executable or do I also need to somehow use servicemonitor? (it works just running my exe but I'm concerned I'm missing out on some pertinent feature by not using servicemonitor.exe)
  2. Have you had any luck with this same technique on nanoserver? I'm struggling to find the right commands to do this without powershell. The netsh example earlier in this thread require the InterfaceName, which is dynamic in the pod.
@esheris

This comment has been minimized.

Copy link

esheris commented Feb 6, 2018

I certainly agree that it feels hacky, I expressed similar in my original post
Sorry, my app is a .net4.7 app with some webforms stuff in it so we can't run nano server/.netcore so i'm not really positive on how to answer your questions. I just got the default entrypoint of one of my older images (docker inspect imageguid) and tacked it on to the bottom of my entrypoint script.

I just pulled microsoft/nanoserver:latest and launched it (docker run -it microsoft/nanoserver:latest powershell) and it seems get-netadapter and set-dnsclienterveraddress/set-dnsclient are there

@brobichaud

This comment has been minimized.

Copy link

brobichaud commented Feb 6, 2018

Unfortunately nanoserver:latest is Server 1607, and I really need Server 1709 (yeah wierd decision on Microsofts part). Server 1709 removed PowerShell support. :-( I'll continue iterating on it and post a response here if I come up with a solution for nanoserver 1709. Or I may resort to using the new powershell core in nanoserver 1709.

@esheris

This comment has been minimized.

Copy link

esheris commented Feb 6, 2018

You could assume the nic name which should always be the same and set it with netsh that way

netsh interface ip set dns "Ethernet" static <dnsip>

run/exec into your container and validate its name first, "Ethernet" was what I had in my previously mentioned nanoserver container

@brobichaud

This comment has been minimized.

Copy link

brobichaud commented Feb 6, 2018

I can see if I open a new nanoserver container locally the name is always "Ethernet" but the interface name appears to be dynamic in an ACS k8s pod. For example mine is now:

vEthernet (beb30eddfc08797307915783cb1c32039566d8f9ac7911334cbebd8dd0e366a2_l2bridge)

But to prove this even can work with netsh I opened a command prompt in my pod and tried to do it manually, the result is:

The requested operation requires elevation (Run as administrator)

Do you know how to elevate a command prompt in a container/pod?

@brobichaud

This comment has been minimized.

Copy link

brobichaud commented Feb 7, 2018

Argh. Roadblocked here with nanoserver. The elevation issue has prevented me from pursuing the netsh approach. I cannot find anything on how I can elevate to admin in a nanoserver command prompt.

So then I thought maybe I'd explore the PowerShell Core path with nanoserver since I've got a script that works on servercore. Alas PowerShell Core does not support Set-DnsClientServerAddress. I suspect because that cmdlet is very Windows specific and Core is designed as x-plat.

Dead-end. I can of course migrate my DotNet Core app to run on servercore, which I don't really want as it feels like a step backwards. And it means automating the install of DotNet Core since there is no pre-built servercore image with DotNet Core.

I gotta say, nanoserver is easy to love and yet even easier to hate. :-(

@esheris

This comment has been minimized.

Copy link

esheris commented Feb 7, 2018

@brobichaud

This comment has been minimized.

Copy link

brobichaud commented Feb 7, 2018

A good suggestion @esheris on the idea of running an entrypoint script. Alas I tried and see the same error about elevation being required. Feels like I am so close as I have discovered that the interface index is consistently 30, so if I had permissions I could use this command to set the DNS server:

netsh interface ip set dns 30 static 10.244.0.3

As for runas, it does not exist in nanoserver. Blocked by nanoserver at every path it feels! I may have to step back and move my nanoserver use to servercore until Microsoft gets this fixed. Sooo not what I want to do, I really want to get some legs on nanoserver as we are building up this greenfield app, not migrate it later and see what breaks all at once! :-(

@msorby

This comment has been minimized.

Copy link

msorby commented Feb 8, 2018

This one is of interest #2230

In the mean time I have this a workaround to find the current IP addresses of kube-dns pods.
I'm running servercore so I can use Set-DnsClientServerAddress.

@4c74356b41

This comment has been minimized.

Copy link

4c74356b41 commented Feb 8, 2018

just so anyone runs into the same issue, for me the workaround didnt work, unless I added Start-Sleep 10 (will probably work with less).

@brobichaud

This comment has been minimized.

Copy link

brobichaud commented Feb 8, 2018

I looked through #2230 and it does look interesting but its not clear to me that it addresses this issue. Clearly there are other DNS issues in Windows 1709 itself, but I wonder if the problem we are seeing is in fact Windows or the way k8s is setup with Windows nodes?

This just feels like such a huge roadblocker of an issue that it should be of highest priority to get fixed.

@msorby

This comment has been minimized.

Copy link

msorby commented Feb 8, 2018

Yeah it is a huge blocker. I actually pulled down the pull request and merged in the latest changes from acs-engine\master and built it. Still no DNS resolution from 1709 containers...

@4c74356b41

This comment has been minimized.

Copy link

4c74356b41 commented Jun 22, 2018

@SteveCurran technically AKS has nothing to do with this (even though it's using ACS engine behind the scenes). But overall yeah. This is a trainwreck. And it's getting worse.

@sam-cogan

This comment has been minimized.

Copy link

sam-cogan commented Jun 25, 2018

It does seem to be getting worse, and inconsistently so. I have a windows init container that works fine calling out to the internet, then the main container it spawns can't resolve anything, it makes no sense.

@digeler

This comment has been minimized.

Copy link

digeler commented Jul 15, 2018

Any updates on this?

@sam-cogan

This comment has been minimized.

Copy link

sam-cogan commented Jul 15, 2018

Since moving to asc engine release 0.19 and using 1803 images this seems to have improved significantly. DNS is resolving as expected and had stayed that way for some time.

@SteveCurran

This comment has been minimized.

Copy link

SteveCurran commented Jul 16, 2018

I have moved to 19.3 and still using 1709 and I am seeing much more stability. Still using the DNS addresses to the individual DNS pods. The DNS server seems to get overwhelmed when trying to push many deployments at once. If I push deployments slowly then all is well.

@ocdi

This comment has been minimized.

Copy link

ocdi commented Jul 16, 2018

@SteveCurran how did you move to 19.3? I have a cluster that I would like to upgrade and I am already running Kubernetes 1.11.0, the upgrade command is a no-op as there is nothing to upgrade. 0.19.2 seems to have some networing changes that I am hoping will solve my current issues but am unsure how to actually do this. I could always drop & re-create a complete new cluster, but that is a lot of work and the main thing I am unsure on is keeping the ingress LB IP. :-)

@SteveCurran

This comment has been minimized.

Copy link

SteveCurran commented Jul 16, 2018

@ocdi I dropped and recreated.

@atomaras

This comment has been minimized.

Copy link

atomaras commented Jul 17, 2018

I have the same problem. Cluster is unusable. Used acs-engine 0.19.1, windows server 1803, k8s version 1.11.0-rc.3

@PatrickLang

This comment has been minimized.

Copy link
Member

PatrickLang commented Jul 17, 2018

This is fixed in acs-engine 0.19.2. If you're still hitting it in that version or later, can you share details? Otherwise, can we close this issue?

@ocdi

This comment has been minimized.

Copy link

ocdi commented Jul 17, 2018

@PatrickLang This may be a silly question, but how do we upgrade an existing cluster to 0.19.2? I can see how to upgrade if I am changing the kubernetes version, however I used 0.19.1 to upgrade to k8s 1.11.0 already and it is a no-op as I am already at the target version.

@SteveCurran

This comment has been minimized.

Copy link

SteveCurran commented Jul 18, 2018

@PatrickLang does this require running 1803 on the host and in the container? When using 1709 host and container we still need to use the individual ip addresses of the DNS pods and not 10.0.0.10.

@atomaras

This comment has been minimized.

Copy link

atomaras commented Jul 18, 2018

@PatrickLang doesn't work. acs-engine v1.19.3, k8s 1.11.0_rc3, server 1803. Windows pods cant access kubedns.

@jsturtevant

This comment has been minimized.

Copy link
Collaborator

jsturtevant commented Jul 19, 2018

@atomaras could you try with 1.19.5 and 1.11.1 and server 1803? I was able to successfully do DNS queries from windows and linux pods with those version.

@atomaras

This comment has been minimized.

Copy link

atomaras commented Jul 19, 2018

@jsturtevant Can I simply use acs-engine upgrade or do I have to recreate the cluster?

@jsturtevant

This comment has been minimized.

Copy link
Collaborator

jsturtevant commented Jul 19, 2018

I usually drop and recreate to make sure everything is deployed properly.

@4c74356b41

This comment has been minimized.

Copy link

4c74356b41 commented Jul 20, 2018

@jsturtevant viable production approach

@ocdi

This comment has been minimized.

Copy link

ocdi commented Jul 20, 2018

I upgraded an existing cluster from 1.11.0 to 1.11.1 with acs engine 0.19.5 with 1803 and so far so good, the baseline CPU usage has dropped which is nice, from 15-20% constant to maybe 10%. Not sure what about the previous version was using so much CPU but more for containers to run, the better. Haven't observed any DNS issues so far but it's only been an hour. :-)

@atomaras

This comment has been minimized.

Copy link

atomaras commented Jul 20, 2018

@jsturtevant I recreated the cluster with k8s 1.11.1 and some windows containers work but others aren’t. I don’t know why. Specifically I run a windowsservercore:1803 busybox-style image in default namespace and DNS worked. Then I run my windows jenkins agent image based on dotnet framework 1803 inside jenkins namespace and it didn’t work (same as before).

Some extra observations: 1) aci-networking container still gets scheduled on windows nodes and fails so I have to patch the deployment and 2) initially I tried upgrading the cluster which resulted in only master node becoming 1.11.1 but other nodes remained at 1.11.0-rc3 so I ended up recreating the cluster

@jsturtevant

This comment has been minimized.

Copy link
Collaborator

jsturtevant commented Jul 20, 2018

@atomaras I believe the issue your seeing is because you are in a separate namespace with the second pod. Could exec into the pod in the jenkins namespace and run ipconfig /all and post output here? Can you connect to other pods when if you use the fully qualified name?

Additionally what happens when you run the jenkins deployment in the defualt namespace?

@ocdi Thanks for the update. If you see pods drop network connectivity/DNS over a given time drop a note here.

@PatrickLang

This comment has been minimized.

Copy link
Member

PatrickLang commented Jul 20, 2018

Yes, there's a problem where only the pod's namespace is added to the DNS suffix resolution list.
kubernetes/kubernetes#65016 mentions this as well. We need a specific fix in azure-cni so I'm checking to make sure a tracking issue is filed there

@atomaras

This comment has been minimized.

Copy link

atomaras commented Jul 20, 2018

I narrowed it down to being tied to a specific node. I have 2 windows nodes.
Node 31051k8s9001 works correctly:
image

but node 31051k8s9000 now fails with (this is the one that used to fail the DNS):
image

which is most likely tied to the DNS issue.

Please note that those nodes have barely any containers running on them.

@PatrickLang

This comment has been minimized.

Copy link
Member

PatrickLang commented Jul 21, 2018

Here's the issue for the incomplete DNS suffix list: Azure/azure-container-networking#206

@PatrickLang

This comment has been minimized.

Copy link
Member

PatrickLang commented Jul 21, 2018

@atomaras - the failure you highlighted above is due to IP address exhaustion "Failed to allocate address: … No available addresses"

The error isn't being handled correctly due to Azure/azure-container-networking#195

@atomaras

This comment has been minimized.

Copy link

atomaras commented Jul 21, 2018

Thank you @PatrickLang !
I'll be keeping an eye out for these.

@PatrickLang PatrickLang added this to Backlog in Windows Support Aug 3, 2018

@PatrickLang PatrickLang moved this from Backlog to Triage in Windows Support Aug 3, 2018

@PatrickLang PatrickLang moved this from Triage to In Progress in Windows Support Aug 3, 2018

@PatrickLang PatrickLang moved this from In Progress to Blocked Waiting on outside fixes in Windows Support Aug 22, 2018

@PatrickLang PatrickLang moved this from Blocked Waiting on outside fixes to In Progress in Windows Support Aug 24, 2018

@PatrickLang PatrickLang moved this from In Progress to Blocked Waiting on outside fixes in Windows Support Sep 15, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment