Skip to content
This repository has been archived by the owner on Jan 11, 2023. It is now read-only.

k8s: DNS resolution problems on windows nodes #558

Closed
ajorkowski opened this issue Apr 30, 2017 · 35 comments
Closed

k8s: DNS resolution problems on windows nodes #558

ajorkowski opened this issue Apr 30, 2017 · 35 comments

Comments

@ajorkowski
Copy link

ajorkowski commented Apr 30, 2017

I'm using a fairly recent commit (#520 d852aba) of acs-engine with kubernetes 1.6.2 including the winnat commit - however unfortunately I'm seeing some issues with DNS resolution that is pretty intermittent (to be fair I think they have been happening since I have been working on it, but there have been some other blocking issues)

Everything is working fairly well (I'm really enjoying the fast spin up time) and then all of a sudden calls will start to timeout in our backend. When I ssh into the box and run some powershell scripts I get something like the following:

PS C:\install> nslookup idnttest.table.core.windows.net
Server:  kube-dns.kube-system.svc.cluster.local
Address:  10.0.0.10

Non-authoritative answer:
Name:    table.by3prdstr02a.store.core.windows.net
Address:  168.63.89.144
Aliases:  idnttest.table.core.windows.net

PS C:\install> curl idnttest.table.core.windows.net
curl : The remote name could not be resolved: 'idnttest.table.core.windows.net'
At line:1 char:1
+ curl idnttest.table.core.windows.net
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : InvalidOperation: (System.Net.HttpWebRequest:HttpWebRequest) [Invoke-WebRequest], WebException
    + FullyQualifiedErrorId : WebCmdletWebResponseException,Microsoft.PowerShell.Commands.InvokeWebRequestCommand

These commands were executed right after each other, so I don't understand why the name is not being resolved?

If I do a ipconfig /flushdns call the curl command (and my app) works again. After a little while (hour or two?) things will start to fail again.

I can confirm that not every external service is failing, I am able to curl other urls without an issue while this is happening. And I haven't seem the problem happen with internal cluster dns names, although this happens less frequently on our test app.

Maybe I don't really understand how the DNS lookup is working here. Anything I can do to alleviate this issue or debug it further?

Here's some more info that might be relevant:

  1. I have a hybrid cloud with 2 windows nodes and 1 linux node
  2. I am using an ingress controller (nginx) to handle traffic from a single load balanced service
  3. All deployments also have a non-load balanced service attached to them (for the ingress to point to)
@JiangtianLi
Copy link
Contributor

@ajorkowski What is the output of ipconfig /all? Does ping idnttest.table.core.windows.net fail too due to not able to resolve name? #539 might be related if you have service running on the Linux node.

@ajorkowski
Copy link
Author

ajorkowski commented May 1, 2017

@JiangtianLi When I mention 'service' I am talking about the kubernetes services, so we have some that are pointing to windows deployments, and some that are pointing to linux deployments, but I haven't seem any problems with the connectivity between the windows <-> linux nodes (not confirmed though) and from the outside internet -> containers (confirmed). This seems to be a problem with accessing the outside internet from within the containers on windows nodes.

Later today I'll do some debugging and get the answers to your questions.

@skinny
Copy link
Contributor

skinny commented May 1, 2017

I'm having this exact same issue. After a (random) while the DNS lookup fails and never comes back fully until a ipconfig /flushdns is done inside the container. I noticed that when I manually enter the container and perform a nslookup it succeeds most of the time but a ping command to the same host will result in 'no such name' /host unknown error.

Also the problem appears to be occuring with external hostnames most of the time (mailgun, Azure storage and SQL)

As a workaround for now I have added a scheduled task (every 30 mins) to flush the DNS of my apps containers.

docker ps -f name=k8s_xxxx --format "{{.ID}}" | %{docker exec $_ ipconfig /flushdns}

@ajorkowski
Copy link
Author

Ok, here is another example (this time is it cdn.raygun.io and not the table services).

PS C:\install> nslookup cdn.raygun.io
Server:  kube-dns.kube-system.svc.cluster.local
Address:  10.0.0.10

Non-authoritative answer:
Name:    d1bs4b7zdgd8l3.cloudfront.net
Addresses:  2600:9000:202f:c000:17:62f0:2dc0:93a1
  2600:9000:202f:3a00:17:62f0:2dc0:93a1
  2600:9000:202f:8c00:17:62f0:2dc0:93a1
  2600:9000:202f:8800:17:62f0:2dc0:93a1
  2600:9000:202f:fa00:17:62f0:2dc0:93a1
  2600:9000:202f:1600:17:62f0:2dc0:93a1
  2600:9000:202f:2600:17:62f0:2dc0:93a1
  2600:9000:202f:7000:17:62f0:2dc0:93a1
  54.192.119.154
  54.192.119.122
  54.192.119.228
  54.192.119.243
  54.192.119.197
  54.192.119.159
  54.192.119.42
  54.192.119.138
Aliases:  cdn.raygun.io

PS C:\install> ping cdn.raygun.io
Ping request could not find host cdn.raygun.io. Please check the name and try again.
PS C:\install> curl cdn.raygun.io
curl : The remote name could not be resolved: 'cdn.raygun.io'
At line:1 char:1
+ curl cdn.raygun.io
+ ~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : InvalidOperation: (System.Net.HttpWebRequest:HttpWebRequest) [Invoke-WebRequest], WebException
    + FullyQualifiedErrorId : WebCmdletWebResponseException,Microsoft.PowerShell.Commands.InvokeWebRequestCommand

Here is the ipconfig /all:

Windows IP Configuration

   Host Name . . . . . . . . . . . . : 72cf1a6b47d5
   Primary Dns Suffix  . . . . . . . :
   Node Type . . . . . . . . . . . . : Hybrid
   IP Routing Enabled. . . . . . . . : No
   WINS Proxy Enabled. . . . . . . . : No
   DNS Suffix Search List. . . . . . : wxt5ky5hgd4uragneppffe5bod.dx.internal.cloudapp.net

Ethernet adapter vEthernet (Container NIC eabddebe):

   Connection-specific DNS Suffix  . :
   Description . . . . . . . . . . . : Hyper-V Virtual Ethernet Adapter #8
   Physical Address. . . . . . . . . : 00-15-5D-03-A4-3C
   DHCP Enabled. . . . . . . . . . . : No
   Autoconfiguration Enabled . . . . : Yes
   Link-local IPv6 Address . . . . . : fe80::74cf:965d:5578:e815%47(Preferred)
   IPv4 Address. . . . . . . . . . . : 10.244.2.153(Preferred)
   Subnet Mask . . . . . . . . . . . : 255.255.255.0
   Default Gateway . . . . . . . . . :
   DNS Servers . . . . . . . . . . . : 10.0.0.10
   NetBIOS over Tcpip. . . . . . . . : Disabled

Ethernet adapter vEthernet (Container NIC 6d7d739a):

   Connection-specific DNS Suffix  . : wxt5ky5hgd4uragneppffe5bod.dx.internal.cloudapp.net
   Description . . . . . . . . . . . : Hyper-V Virtual Ethernet Adapter #9
   Physical Address. . . . . . . . . : 00-15-5D-17-55-EF
   DHCP Enabled. . . . . . . . . . . : No
   Autoconfiguration Enabled . . . . : Yes
   Link-local IPv6 Address . . . . . : fe80::d5c2:28f7:96b7:4980%51(Preferred)
   IPv4 Address. . . . . . . . . . . : 192.168.217.14(Preferred)
   Subnet Mask . . . . . . . . . . . : 255.255.240.0
   Default Gateway . . . . . . . . . : 192.168.208.1
   DNS Servers . . . . . . . . . . . : fec0:0:0:ffff::1%1
                                       fec0:0:0:ffff::2%1
                                       fec0:0:0:ffff::3%1
   NetBIOS over Tcpip. . . . . . . . : Disabled

And just for completeness, I can query other urls:

PS C:\install> nslookup idnttest.table.core.windows.net
Server:  kube-dns.kube-system.svc.cluster.local
Address:  10.0.0.10

Non-authoritative answer:
Name:    table.by3prdstr02a.store.core.windows.net
Address:  168.63.89.144
Aliases:  idnttest.table.core.windows.net

PS C:\install> ping idnttest.table.core.windows.net

Pinging table.by3prdstr02a.store.core.windows.net [168.63.89.144] with 32 bytes of data:
Request timed out.

Ping statistics for 168.63.89.144:
    Packets: Sent = 1, Received = 0, Lost = 1 (100% loss),
Control-C
PS C:\install> curl idnttest.table.core.windows.net
curl : <?xml version="1.0" encoding="utf-8" standalone="yes"?>
<error xmlns="http://schemas.microsoft.com/ado/2007/08/dataservices/metadata">
  <code>InvalidUri</code>
  <message xml:lang="en-US">The requested URI does not represent any resource on the server.
RequestId:7da2e81b-0002-0016-62f3-c2490c000000
Time:2017-05-02T03:26:06.7325700Z</message>
</error>
At line:1 char:1
+ curl idnttest.table.core.windows.net
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : InvalidOperation: (System.Net.HttpWebRequest:HttpWebRequest) [Invoke-WebRequest], WebException
    + FullyQualifiedErrorId : WebCmdletWebResponseException,Microsoft.PowerShell.Commands.InvokeWebRequestCommand

@ajorkowski
Copy link
Author

This is interesting, if I do a ipconfig /displaydns I get the following:

PS C:\install> ipconfig /displaydns

Windows IP Configuration

    rt.services.visualstudio.com
    ----------------------------------------
    Name does not exist.


    wpad
    ----------------------------------------
    Record data for type ALL could not be displayed.


    cdn.raygun.io
    ----------------------------------------
    Name does not exist.

It looks like there may be intermittent failures and they are being cached?

@JiangtianLi
Copy link
Contributor

@ajorkowski The failure is intermittent. I could repro once and as @skinny said, after ipconfig /flushdns, the same name could be resolved. I'll find a way to repro more consistently and investigate.

@ajorkowski
Copy link
Author

I have been using @skinny fix for the last day now (ie using the scheduler on the node to flush the dns in each container) and haven't had any problems so far. It is probably not that ideal though....

@skinny
Copy link
Contributor

skinny commented May 3, 2017

Yeah I needed to shorten the interval to 10 minutes to avoid any issues yesterday but it does the job until we get a proper fix

Im also going to experiment with the Windows DNS cache a bit to see if that helps

@ajorkowski I've modified my container run script to include the following two lines. These will disable all DNS caching inside the container so I don't need to flush it anymore and have even a small window of failed lookups

Set-Service dnscache -StartupType disabled
Stop-Service dnscache

& Application.Run.exe

Depending on the type of application, this might be useful for you too while we wait for a real fix

@ajorkowski
Copy link
Author

@skinny Awesome, I think for right now the 10 min ping is sufficient as we are really just running a test server at the moment, but its good to know that there is an alternative that is even more reliable. Thanks for sharing.

@ajorkowski
Copy link
Author

ajorkowski commented May 6, 2017

Just a bit of additional information for this thread - I created a v1.5.3 kubernetes cluster and I did not see this DNS issue. So it seems to be just affecting v1.6.2 from my own testing.

After a day or two I did this same issue occur in a v1.5.3 cluster.

@ajorkowski
Copy link
Author

ajorkowski commented Sep 1, 2017

Just an update on this - I recently upgraded our cluster using 0.5.0 acs engine (1.7.2 kubernetes) and tried turning the DNS cache back on and was still getting this issue. Still no idea what is causing it - it is like the DNS temporarily fails and then the cache seems to cause that failure to 'stick'.

@humphs
Copy link

humphs commented Sep 5, 2017

I found setting the pod dnspolicy to default stopped my timeout issues. This is fine for me as I only need to resolve public DNS records.

If you need cluster name resolution, try disabling the negative dns cache in your container. If you do get a DNS timeout, the bad record won't get stuck in your cache and usually resolves correctly next time.

https://support.microsoft.com/en-gb/help/318803/how-to-disable-client-side-dns-caching-in-windows-xp-and-windows-serve

@skinny
Copy link
Contributor

skinny commented Sep 5, 2017

@Dm3r yes disabling DNS cache is what I have been using for a couple of months now (see my previous comments). However I am still experiencing lookup failures which causes the container to fail starting completely. After a couple of restarts (random up until 20 restarts) the lookup I need at application startup succeeds and the container starts fine.

So this issue is still present for me too unfortunately

@humphs
Copy link

humphs commented Sep 5, 2017

I didn't explain properly but you can disable caching for failed DNS queries but still cache successful queries. Might be useful in certain situations.

Are you using internal cluster lookups or do you just need to get out to the internet?

@skinny
Copy link
Contributor

skinny commented Sep 5, 2017

I need both internal and external lookups. As the lookups are pretty quick I am just disabling all caching because that works at least for my use case.

Still the failing lookups are the real issue ;-)

@brobichaud
Copy link

I too am experiencing this with clusters I build today in ACS via the Portal or CLI. I can repro this easily. The commands skinny added to his dockerfile to disable dnscache have worked for me as a temporary fix. I'd love to see this get some attention from Microsoft, I am willing to assist with data if needed.

@neilortoo
Copy link

Is there any update on this issue? We're hitting exactly this problem where DNS seems to just go after a period and then that status sticks until the pod is restarted.

@brobichaud
Copy link

I am still experiencing this issue and it's pretty frustrating that nothing has changed since April. I have k8s 1.7.9 clusters experiencing this. I have clusters with server 2016 ltsc and clusters with Server 2016 v1709 both experiencing this.

I have found that turning off the dnscache service works well but this is strongly not advised for production workloads. I have found this technique does not work with v1709 servers, I think there is a permission issue as I am unable to disable the service at all on v1709 nanoservers.

@JiangtianLi
Copy link
Contributor

@brobichaud As you already know, acs-engine v0.9.2 and after is using Windows 1709 instead of Windows Server 2016. Windows 1709 has a different DNS issue but it has been fixed and should be out in the next Windows update.

@JiangtianLi
Copy link
Contributor

/cc @madhanrm

@brobichaud
Copy link

@JiangtianLi is there any chance you could circle back around and update this thread once the v1709 update you mention is being used by new cluster creations? I'd be happy to try my scenario once that is available.

@JiangtianLi
Copy link
Contributor

@brobichaud Sure, will do. Currently the DNS issue is on 1709 is random (due to race condition) and happens only in the first 15 minutes after container starts up. If you still have DNS issue after that, then it should be due to a different root cause and please let us know.

@jwhousley
Copy link

What do you recommend I do if i have a StartUp.ps that currently fails to make a http request via the "Invoke Web-Request" module? I have been fighting this since I create a cluster in ACS. Do i need to disable DNS? If so how?

@patrick-motard
Copy link

patrick-motard commented Feb 22, 2018

@JiangtianLi
acs-engine: 0.11.0
nodes: 1709
kubernetes: 1.7.9

Disabling cache in windows container not allowed:

PS C:\> Set-Service dnscache -StartupType disabled 
Set-Service : Service 'DNS Client (dnscache)' cannot be configured due to the following error: Access is denied 
At line:1 char:1                                                                                                                                                                                                    + Set-Service dnscache -StartupType disabled                                                                                                                                                                        + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                                                                                                                                                                            + CategoryInfo          : PermissionDenied: (System.ServiceProcess.ServiceController:ServiceController) [Set-Service], ServiceCommandException
+ FullyQualifiedErrorId : CouldNotSetService,Microsoft.PowerShell.Commands.SetServiceCommand   

Deployed "simpleweb.yaml" service following these instructions:

apiVersion: v1
kind: Service
metadata:
  name: win-webserver
  labels:
    app: win-webserver
spec:
  ports:
    # the port that this service should serve on
  - port: 80
    targetPort: 80
  selector:
    app: win-webserver
  type: LoadBalancer
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  labels:
    app: win-webserver
  name: win-webserver
spec:
  replicas: 1
  template:
    metadata:
      labels:
        app: win-webserver
      name: win-webserver
    spec:
      containers:
      - name: windowswebserver
        image: microsoft/windowsservercore:1709
        command:
        - powershell.exe
        - -command
        - "<#code used from https://gist.github.com/wagnerandrade/5424431#> ; $$listener = New-Object System.Net.HttpListener ; $$listener.Prefixes.Add('http://*:80/') ; $$listener.Start() ; $$callerCounts = @{} ; Write-Host('Listening at http://*:80/') ; while ($$listener.IsListening) { ;$$context = $$listener.GetContext() ;$$requestUrl = $$context.Request.Url ;$$clientIP = $$context.Request.RemoteEndPoint.Address ;$$response = $$context.Response ;Write-Host '' ;Write-Host('> {0}' -f $$requestUrl) ;  ;$$count = 1 ;$$k=$$callerCounts.Get_Item($$clientIP) ;if ($$k -ne $$null) { $$count += $$k } ;$$callerCounts.Set_Item($$clientIP, $$count) ;$$header='<html><body><H1>Windows Container Web Server</H1>' ;$$callerCountsString='' ;$$callerCounts.Keys | % { $$callerCountsString+='<p>IP {0} callerCount {1} ' -f $$_,$$callerCounts.Item($$_) } ;$$footer='</body></html>' ;$$content='{0}{1}{2}' -f $$header,$$callerCountsString,$$footer ;Write-Output $$content ;$$buffer = [System.Text.Encoding]::UTF8.GetBytes($$content) ;$$response.ContentLength64 = $$buffer.Length ;$$response.OutputStream.Write($$buffer, 0, $$buffer.Length) ;$$response.Close() ;$$responseStatus = $$response.StatusCode ;Write-Host('< {0}' -f $$responseStatus)  } ; "
      nodeSelector:
        beta.kubernetes.io/os: windows

The service & pod has been up for 21 minutes (Screenshot).

Not able to ping a server in the same vnet, that I am able to ping from linux containers in the cluster:

PS C:\> ping 10.242.0.4

Pinging 10.242.0.4 with 32 bytes of data:
Request timed out.
Request timed out.
Request timed out.
Request timed out.

Ping statistics for 10.242.0.4:
    Packets: Sent = 4, Received = 0, Lost = 4 (100% loss),
PS C:\> nslookup 10.242.0.4
DNS request timed out.
    timeout was 2 seconds.
Server:  UnKnown
Address:  10.0.0.10

DNS request timed out.
    timeout was 2 seconds.
*** Request to UnKnown timed-out

Not able to curl internal dns of other service:

PS C:\> curl -UseBasicParsing http://MY_SERVICE_DOMAIN/MYROUTE
curl : The remote name could not be resolved: 'MY_SERVICE_DOMAIN'
At line:1 char:1
+ curl -UseBasicParsing http://MY_SERVICE_DOMAIN/MYROUTE
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : InvalidOperation: (System.Net.HttpWebRequest:HttpWebRequest) [Invoke-WebRequest], WebException
    + FullyQualifiedErrorId : WebCmdletWebResponseException,Microsoft.PowerShell.Commands.InvokeWebRequestCommand

@SteveCurran
Copy link

This issue has become incredibly frustrating. I am using ACS 1.7.7 and can only occasionally resolve a remote URL even with dns caching turned off. This is preventing us from moving forward with ACS.

@JiangtianLi
Copy link
Contributor

@SteveCurran Sorry for the inconvenience. There is some update in another thread: #2027. With the next Windows/docker image update, it will mitigate DNS issue in most scenarios.

@patrick-motard
Copy link

@SteveCurran I was able to overcome this issue in a new cluster using the following:

acs-engine: 0.13.0
kubernetes: 1.8.4

A couple things I learned when trying to update my cluster using these instructions: When i tried updating the 1.7/0.11.0 cluster to 1.8.4 using 0.13.0, it failed almost immidiately, so the update process has changed between the two acs-engine versions and has introduced new fields somewhere in the engine's output folder. When I tried updating 1.7/0.11.0 to 1.8.4 using the 0.11.0 engine, it failed during the ARM update steps. I have yet to debug the output (it didn't give me a helpful error).

So at this point I'm looking at:

  1. try to figure out why the update is breaking and fix it
  2. if i can't fix the update, i'm going to have to create a new cluster with 1.8.4.

Side note/rant: My cluster needs to communicate with other external servers in the same vnet. These servers take a lot of time and effort to configure once they're provisioned (due to the legacy applications that need to run on the servers). Hybrid clusters using the acs-engine do not currently support deploying to an existing vnet. So I'm now on my third time deploying a new cluster, new vnet, and new legacy servers. I'm keeping my fingers crossed that this gets fixed before we need to update a production cluster because at this point it's going to require a complete rebuild of the vnet and all the resources within it.

@roycornelissen
Copy link

roycornelissen commented Feb 28, 2018

@patrick-motard did you manage to mitigate the issue with a 1.8.4 cluster? I'm running a 1.8.8 cluster now, and one of my pods seems to have internet access (I'm using Azure AD amongst others), but another pod still cannot resolve the host name, so the problem isn't going away.

I have the same issues you have, I have to rely on an Azure admin in my company to the cluster according to my acs-engine templates (rights issue), and I've had to bother them too many times already :(

@JiangtianLi Is there a way to obtain the new image with the fix already?

@patrick-motard
Copy link

patrick-motard commented Feb 28, 2018

@roycornelissen after working on this issue today some more, I am still having this bug. On 1.7 I was not able to communicat with the internal k8s DNS from inside a windows container in the cluster. On 1.8.4 I can now resolve service DNS from within a windows container. I still cannot access the internet nor any ips within the vnet (outside of the cluster) from a windows container in k8s 1.8.4.

@patrick-motard
Copy link

Also @JiangtianLi, this issue does not seem to be the same as the 15 min race condition bug. My containers have been deployed for hours and are exhibiting this behavior.

@patrick-motard
Copy link

@skinny 's dns flushing quick fix does not fix this issue for me either.

@jdinard
Copy link

jdinard commented Apr 9, 2018

I'm having the same issue as @patrick-motard. Are there any updates to this?

@daschott
Copy link

daschott commented Jul 20, 2018

@jdinard @patrick-motard There have been a lot of networking fixes as of acs-engine 0.19.2 and 0.19.3 -- do you still see this issue there (on Windows 1803)?

For Windows 1709, there is a list of known DNS issues outlined here.

@PatrickLang
Copy link
Contributor

No new responses here - closing. If you hit this problem on a newer deployment, please open a new issue.

@sylus
Copy link
Contributor

sylus commented Nov 2, 2018

Hitting this problem on a new deployment, will file a new issue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests