K8s Windows 1803 pod creation fails with "HNS failed with error : Element not found" #853

brobichaud · 2018-10-16T17:20:16Z

Is this a request for help?:
Yes

Is this an ISSUE or FEATURE REQUEST? (choose one):
Issue

What version of acs-engine?:
22.1

Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm)
Kubernetes

What happened:
I have a cluster w/ k8s 10.8.1 running Windows v1803 nodes. All was well for 6 days with a dozen of my pods deployed, then suddenly I start seeing the error below when deploying new versions of a pod:

Failed create pod sandbox: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod "dev-fossil-api-777f779f74-fhx88_default" network: Failed to create endpoint: HNS failed with error : Element not found.

I noticed these errors are only seen when k8s tries to schedule new pods on a specific node. I have 2 Windows nodes, one is fine the other exhibits this problem.

Since this seems network related I looked at the ipconfig /all results on each node and do not see a pattern. Here is the output in case it's useful.

First node exhibiting the problem:

Windows IP Configuration
   Host Name . . . . . . . . . . . . : 13832k8s9000
   Primary Dns Suffix  . . . . . . . :
   Node Type . . . . . . . . . . . . : Hybrid
   IP Routing Enabled. . . . . . . . : No
   WINS Proxy Enabled. . . . . . . . : No
   DNS Suffix Search List. . . . . . : i2fxeuipe3wuzilxz3jq1ieqbb.cx.internal.cloudapp.net

Ethernet adapter vEthernet (Ethernet 2):
   Connection-specific DNS Suffix  . : i2fxeuipe3wuzilxz3jq1ieqbb.cx.internal.cloudapp.net
   Description . . . . . . . . . . . : Hyper-V Virtual Ethernet Adapter #2
   Physical Address. . . . . . . . . : 00-0D-3A-03-52-90
   DHCP Enabled. . . . . . . . . . . : Yes
   Autoconfiguration Enabled . . . . : Yes
   Link-local IPv6 Address . . . . . : fe80::5542:5b81:1f86:875c%6(Preferred)
   IPv4 Address. . . . . . . . . . . : 10.240.0.34(Preferred)
   Subnet Mask . . . . . . . . . . . : 255.240.0.0
   Lease Obtained. . . . . . . . . . : Tuesday, October 16, 2018 1:37:01 AM
   Lease Expires . . . . . . . . . . : Friday, November 22, 2154 8:29:19 AM
   Default Gateway . . . . . . . . . : 10.240.0.1
   DHCP Server . . . . . . . . . . . : 168.63.129.16
   DHCPv6 IAID . . . . . . . . . . . : 218107194
   DHCPv6 Client DUID. . . . . . . . : 00-01-00-01-23-38-4F-57-00-15-5D-1C-3A-59
   DNS Servers . . . . . . . . . . . : 168.63.129.16
   NetBIOS over Tcpip. . . . . . . . : Enabled

Ethernet adapter vEthernet (nat):
   Connection-specific DNS Suffix  . :
   Description . . . . . . . . . . . : Hyper-V Virtual Ethernet Adapter
   Physical Address. . . . . . . . . : 00-15-5D-3D-2D-76
   DHCP Enabled. . . . . . . . . . . : Yes
   Autoconfiguration Enabled . . . . : Yes
   Link-local IPv6 Address . . . . . : fe80::8d3:f1d2:1e7:4805%10(Preferred)
   IPv4 Address. . . . . . . . . . . : 172.22.176.1(Preferred)
   Subnet Mask . . . . . . . . . . . : 255.255.240.0
   Default Gateway . . . . . . . . . :
   DHCPv6 IAID . . . . . . . . . . . : 201332061
   DHCPv6 Client DUID. . . . . . . . : 00-01-00-01-23-38-4F-57-00-15-5D-1C-3A-59
   DNS Servers . . . . . . . . . . . : fec0:0:0:ffff::1%1
                                       fec0:0:0:ffff::2%1
                                       fec0:0:0:ffff::3%1
   NetBIOS over Tcpip. . . . . . . . : Enabled

Node that is working fine:

Windows IP Configuration
   Host Name . . . . . . . . . . . . : 13832k8s9001
   Primary Dns Suffix  . . . . . . . :
   Node Type . . . . . . . . . . . . : Hybrid
   IP Routing Enabled. . . . . . . . : No
   WINS Proxy Enabled. . . . . . . . : No
   DNS Suffix Search List. . . . . . : i2fxeuipe3wuzilxz3jq1ieqbb.cx.internal.cloudapp.net

Ethernet adapter vEthernet (Ethernet 2):
   Connection-specific DNS Suffix  . : i2fxeuipe3wuzilxz3jq1ieqbb.cx.internal.cloudapp.net
   Description . . . . . . . . . . . : Hyper-V Virtual Ethernet Adapter #2
   Physical Address. . . . . . . . . : 00-0D-3A-03-5A-E5
   DHCP Enabled. . . . . . . . . . . : Yes
   Autoconfiguration Enabled . . . . : Yes
   Link-local IPv6 Address . . . . . : fe80::c71:f557:2367:787%8(Preferred)
   IPv4 Address. . . . . . . . . . . : 10.240.0.65(Preferred)
   Subnet Mask . . . . . . . . . . . : 255.240.0.0
   Lease Obtained. . . . . . . . . . : Monday, October 15, 2018 9:14:07 PM
   Lease Expires . . . . . . . . . . : Friday, November 22, 2154 11:38:39 PM
   Default Gateway . . . . . . . . . : 10.240.0.1
   DHCP Server . . . . . . . . . . . : 168.63.129.16
   DHCPv6 IAID . . . . . . . . . . . : 218107194
   DHCPv6 Client DUID. . . . . . . . : 00-01-00-01-23-38-4F-61-00-15-5D-1C-3A-59
   DNS Servers . . . . . . . . . . . : 168.63.129.16
   NetBIOS over Tcpip. . . . . . . . : Enabled

Ethernet adapter vEthernet (nat):
   Media State . . . . . . . . . . . : Media disconnected
   Connection-specific DNS Suffix  . :
   Description . . . . . . . . . . . : Hyper-V Virtual Ethernet Adapter
   Physical Address. . . . . . . . . : 00-15-5D-3D-2D-76
   DHCP Enabled. . . . . . . . . . . : Yes
   Autoconfiguration Enabled . . . . : Yes

Ethernet adapter vEthernet (nat) 2:
   Connection-specific DNS Suffix  . :
   Description . . . . . . . . . . . : Hyper-V Virtual Ethernet Adapter #3
   Physical Address. . . . . . . . . : 00-15-5D-0D-0F-74
   DHCP Enabled. . . . . . . . . . . : Yes
   Autoconfiguration Enabled . . . . : Yes
   Link-local IPv6 Address . . . . . : fe80::934:49c5:45af:a7%7(Preferred)
   IPv4 Address. . . . . . . . . . . : 172.29.96.1(Preferred)
   Subnet Mask . . . . . . . . . . . : 255.255.240.0
   Default Gateway . . . . . . . . . :
   DHCPv6 IAID . . . . . . . . . . . : 318772573
   DHCPv6 Client DUID. . . . . . . . : 00-01-00-01-23-38-4F-61-00-15-5D-1C-3A-59
   DNS Servers . . . . . . . . . . . : fec0:0:0:ffff::1%1
                                       fec0:0:0:ffff::2%1
                                       fec0:0:0:ffff::3%1
   NetBIOS over Tcpip. . . . . . . . : Enabled

Second node exhibiting the problem:

Windows IP Configuration
   Host Name . . . . . . . . . . . . : 13832k8s9002
   Primary Dns Suffix  . . . . . . . :
   Node Type . . . . . . . . . . . . : Hybrid
   IP Routing Enabled. . . . . . . . : No
   WINS Proxy Enabled. . . . . . . . : No
   DNS Suffix Search List. . . . . . : i2fxeuipe3wuzilxz3jq1ieqbb.cx.internal.cloudapp.net

Ethernet adapter vEthernet (Ethernet 2):
   Connection-specific DNS Suffix  . : i2fxeuipe3wuzilxz3jq1ieqbb.cx.internal.cloudapp.net
   Description . . . . . . . . . . . : Hyper-V Virtual Ethernet Adapter #9
   Physical Address. . . . . . . . . : 00-0D-3A-04-8A-BB
   DHCP Enabled. . . . . . . . . . . : Yes
   Autoconfiguration Enabled . . . . : Yes
   Link-local IPv6 Address . . . . . : fe80::3cf5:3c5c:b63e:d6c4%11(Preferred)
   IPv4 Address. . . . . . . . . . . : 10.240.0.96(Preferred)
   Subnet Mask . . . . . . . . . . . : 255.240.0.0
   Lease Obtained. . . . . . . . . . : Monday, October 15, 2018 9:19:12 PM
   Lease Expires . . . . . . . . . . : Friday, November 22, 2154 3:59:03 AM
   Default Gateway . . . . . . . . . : 10.240.0.1
   DHCP Server . . . . . . . . . . . : 168.63.129.16
   DHCPv6 IAID . . . . . . . . . . . : 150998330
   DHCPv6 Client DUID. . . . . . . . : 00-01-00-01-23-51-17-95-00-15-5D-1C-3A-59
   DNS Servers . . . . . . . . . . . : 168.63.129.16
   NetBIOS over Tcpip. . . . . . . . : Enabled

Ethernet adapter vEthernet (nat):
   Media State . . . . . . . . . . . : Media disconnected
   Connection-specific DNS Suffix  . :
   Description . . . . . . . . . . . : Hyper-V Virtual Ethernet Adapter
   Physical Address. . . . . . . . . : 00-15-5D-3D-2D-76
   DHCP Enabled. . . . . . . . . . . : Yes
   Autoconfiguration Enabled . . . . : Yes

Ethernet adapter vEthernet (nat) 2:
   Connection-specific DNS Suffix  . :
   Description . . . . . . . . . . . : Hyper-V Virtual Ethernet Adapter #2
   Physical Address. . . . . . . . . : 00-15-5D-20-BF-B7
   DHCP Enabled. . . . . . . . . . . : Yes
   Autoconfiguration Enabled . . . . : Yes
   Link-local IPv6 Address . . . . . : fe80::57a:4521:f014:b58c%8(Preferred)
   IPv4 Address. . . . . . . . . . . : 172.17.224.1(Preferred)
   Subnet Mask . . . . . . . . . . . : 255.255.240.0
   Default Gateway . . . . . . . . . :
   DHCPv6 IAID . . . . . . . . . . . : 268440925
   DHCPv6 Client DUID. . . . . . . . : 00-01-00-01-23-51-17-95-00-15-5D-1C-3A-59
   DNS Servers . . . . . . . . . . . : fec0:0:0:ffff::1%1
                                       fec0:0:0:ffff::2%1
                                       fec0:0:0:ffff::3%1
   NetBIOS over Tcpip. . . . . . . . : Enabled

What you expected to happen:
For the pods to be created on any Windows node in the cluster.

How to reproduce it (as minimally and precisely as possible):
I have a cluster that exhibits this now and can easily repro it. But I do not have steps to repro it on a new cluster, it feels random-ish.

Anything else we need to know:
Since I have no idea what causes this or how to fix it my only recourse was to taint the affected node so k8s stops trying to schedule to it. I then scaled my cluster up to get a new node on which I can deploy, and stopped the affected node. Since then the new node worked fine for a few days and then exhibited the same problem. I now have two tainted nodes turned off.

I'm very willing to work with someone that can help me debug this.

The text was updated successfully, but these errors were encountered:

sharmasushant · 2018-10-16T17:45:23Z

@jhowardmsft @madhanrm @dineshgovindasamy Guys, could it be related to HNS along with docker 1.13 as well?

madhanrm · 2018-10-16T17:52:42Z

Please collect the logs using https://github.com/Microsoft/SDN/blob/master/Kubernetes/windows/debug/collectlogs.ps1
Also provide c:\k\azure-vnet.log

I suspect this to be the out of sync issue - ie Azure CNI going out of sync with HNS.

brobichaud · 2018-10-16T18:36:50Z

OK, ran the diagnostics and gathered azure-vnet.log on one of the affected nodes. They are attached.
diags.zip
azure-vnet.log

brobichaud · 2018-10-16T18:41:50Z

BTW, on the sig-windows slack channel someone else has seen the same problem. He said the following:

What I found was that I have two network adapters and that the CNI was trying to attach to the non l2 network adapter. I just reset the HNS network and it started working correctly.

I haven't tried that as I don't know how to reset the HNS network.

madhanrm · 2018-10-16T19:01:09Z

@sharmasushant can you add code in Azure CNI, to check if the network exists, before creating an endpoint. If not, can you re-create the network.

On host restart or Kubelet restart kubeletstart.ps1 would attempt to cleanup the HNS Network and remove the json files. If the json files stick around, then azure cni might think that the network is available, and might try creating an endpoint using it, and ending up with this issue.

madhanrm · 2018-10-16T19:16:58Z

I don't see the error described in this issue in azure-vnet.log

From the diag logs.
The network named azure is missing & hence endpoint creation fails. The azure-vnet.json would confirm, if you see an entry for network.

Name Type ID AddressPrefix

ext L2Bridge 24a89481-38bb-4bf2-b6c0-52fdd0019708
nat nat c5419986-8ac2-48bf-b86c-1d7c47059705

sharmasushant · 2018-10-16T19:17:35Z

@madhanrm
Is this what happened here? Or is this comment about general goodness?

@saiyan86 can you please take up this change? Lets check for network existence while creating an endpoint.

@madhanrm please understand that this adds overhead to endpoint creation, and we would like to avoid this.
Should we track the issue where you cleanup HNS but not able to remove the state as a bug?

brobichaud · 2018-10-16T19:20:40Z

If needed I can repro the problem on this node by removing the taint.

Separately, is there a way I can manually fix this problem so I can repair my cluster?

madhanrm · 2018-10-16T19:21:19Z

Try to stop Kubelet service, see if any azure-vnet.json file exist, if yes, delete them, and then start the Kubelet service

madhanrm · 2018-10-16T19:25:38Z

@sharmasushant Going forward with RS5 & using the v2 HNS RPC apis, it is mandatory to Open the network (to make sure it exists) & Create endpoint on that network.

Just making sure the network exists is not going to cost much for v1 apis

brobichaud · 2018-10-16T19:27:26Z

@madhanrm what's the correct way to stop the kubelet? using "sc stop kubelet" results in the error:

[SC] ControlService FAILED 1051:
A stop control has been sent to a service that other running services are dependent on.

madhanrm · 2018-10-16T19:36:49Z

Stop-Service Kubelet -Force

Start-Service kubelet

brobichaud · 2018-10-16T19:43:55Z

OK, that let me stop and start the kubelet. But after doing that I am now seeing this error when I try to schedule a pod:

Failed create pod sandbox: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod "dev-madras-resolver-677d758ddb-87qvw_default" network: Failed to allocate pool: Failed to delegate: Failed to allocate pool: No available address pools

Also, I do not see a new azure-vnet.json being created in c:\k. But I do see an azure-vnet.json.lock. Here is azure-vnet.log:
azure-vnet.log

madhanrm · 2018-10-16T20:33:54Z

Can you try deleting below 2 files, while stopping kubelet

madhanrm · 2018-10-16T20:48:05Z

Please delete the lock files as well.

brobichaud · 2018-10-16T21:12:30Z

Ahhh, that did the trick nicely. Thank you so much for saving my cluster. :-)

Is there anything I have done to trigger this? I'm wondering why others have not seen it.

And more importantly, is there anything I can do to avoid this? I know how to fix it now, but it's surely a pain.

tonyschollum · 2018-10-17T08:19:50Z

@madhanrm you are a life saver!!!

zhiweiv · 2018-10-17T12:22:42Z

My Windows node crashed again just now, after I created a new pod. All pods on that node became error.
After the advice above, seems kubelet hang.
PS C:\k> type kubelet.log
NetworkPlugin azure, starting kubelet.
Ok.
c1666c807871
Cleaning up old HNS network found
PS C:\k>

Some logs here.
https://github.com/zhiweiv/apcs/blob/master/azure-vnet-ipam.log
https://github.com/zhiweiv/apcs/blob/master/azure-vnet.log
https://github.com/zhiweiv/apcs/blob/master/kubelet.log
https://github.com/zhiweiv/apcs/blob/master/dir.txt

I am using acs engine 0.23.1, k8s 1.12.1

madhanrm · 2018-10-17T17:59:08Z

@zhiweiv can you also provide the kubelet.err.log ?

zhiweiv · 2018-10-18T04:04:49Z

Hi @madhanrm

I am unable to connect to the problematic Windows node after reboot, seems same issue with #4001
I decide to keep watching other Windows nodes, if issue occur again, I will paste new information here.

Thanks

zhiweiv · 2018-10-19T10:50:44Z

I built a new cluster, issue occurred again.
1 There were 9 pods running in a Windows node,
2 I deleted 6 pods, created 6 pods again, all actions were done one by one in a short time.
3 All pods became error soon,
4 This time the Windows node became Not Ready soon also, and not accessable now, I can't ping to the Windows node ip from master node, so I am unable to collect logs.

It is very like the reproduce steps in #3920.

Thanks

qmalexander · 2018-11-07T08:13:14Z

Hi @zhiweiv,
We had the same problem with problems connecting to the Windows node after reboot #4001 .
Solution to that problem was the following (From ms support).

Logon to Azure Portal
Select your VM (In my case: 39742k8s9011)
Select Networking > Click on your NIC > Ip Configurations
Select your Ip configurations and in this menu, change the internal IP of your VM from dynamic to static but given it a different IP from the one it’s defined now. At the moment your VM is working with internal IP 10.240.0.127. Once you click save, your VM will be restarted and the new NIC presented to the OS. Note: if you already have the internal IP as static, change it to dynamic and then change again to static but give also a different new IP.
(Change it to static then hit save. When complete change it to Dynamic and hit Save. Now you should be able to connect to the node)

qmalexander · 2018-11-09T14:21:25Z

@zhiweiv I have a question. Du you have a resource limit on the containers?
I did not have it at first and I had the exact same issue. But when I changed to a request and limit resources my cluster works just fine :)

zhiweiv · 2018-11-14T07:49:01Z

Thanks @qmalexander , the previous cluster was deleted, I will try the solution if it occurs again.
And fyi, there are resources requests/limits on all my containers.

daschott · 2018-11-14T17:57:01Z

@zhiweiv deleting the azure-vnet*[.lock|.json] in C:\k directory didn't help you? Which Azure CNI version is this?

zhiweiv · 2018-11-15T10:23:27Z

@daschott
I think I deleted them all, after that the node became Not Ready and unaccessable, I did not get chance to try again, CNI version is 1.0.11.

From the symptom I met several times before, when there are about 10 pods or more in a node, the issue occur at uncertain time. Now I keep less than 6 pods in each node carefully, no new issue since that time.

I plan to try more than 10 pods in one node with k8s/Windows Server 2019 when they are available.

sylus · 2018-11-21T17:48:34Z

So I am getting random windows fails to and had to the steps that @qmalexander gave to recover the node. Does anyone have any idea what is going on? This is starting to get really concerning that a node can just randomly fail and become not ready. Thinking i should file a critical issue against this for Azure.

Steps: https://github.com/Azure/acs-engine/issues/4046#issuecomment-436539771

Small example of my logs when things fail:

belet.go:442: Failed to list *v1.Service: Get https://172.20.205.4:443/api/v1/se
rvices?limit=500&resourceVersion=0: dial tcp 172.20.205.4:443: connectex: A socket operation was attempted to an unreachable network.
E1121 17:00:25.439784    6156 reflector.go:134] k8s.io/kubernetes/pkg/kubelet/co
nfig/apiserver.go:47: Failed to list *v1.Pod: Get https://172.20.205.4:443/api/v
1/pods?fieldSelector=spec.nodeName%3D36274k8s9010&limit=500&resourceVersion=0: d
ial tcp 172.20.205.4:443: connectex: A socket operation was attempted to an unre
achable network.
E1121 17:00:25.460738    6156 kubelet.go:2236] node "36274k8s9010" not found
E1121 17:00:25.562526    6156 kubelet.go:2236] node "36274k8s9010" not found
E1121 17:00:25.662861    6156 kubelet.go:2236] node "36274k8s9010" not found
E1121 17:00:25.763012    6156 kubelet.go:2236] node "36274k8s9010" not found
E1121 17:00:25.774939    6156 event.go:212] Unable to write event: 'Post https:/
/172.20.205.4:443/api/v1/namespaces/default/events: dial tcp 172.20.205.4:443: c
onnectex: A socket operation was attempted to an unreachable network.' (may retr
y after sleeping)
E1121 17:00:25.863233    6156 kubelet.go:2236] node "36274k8s9010" not found
E1121 17:00:25.964231    6156 kubelet.go:2236] node "36274k8s9010" not found
E1121 17:00:26.065082    6156 kubelet.go:2236] node "36274k8s9010" not found
E1121 17:00:26.165977    6156 kubelet.go:2236] node "36274k8s9010" not found
E1121 17:00:26.266949    6156 kubelet.go:2236] node "36274k8s9010" not found
E1121 17:00:26.367541    6156 kubelet.go:2236] node "36274k8s9010" not found

sylus · 2018-11-21T17:50:08Z

@PatrickLang would you have any idea on this problem?

daschott · 2018-11-21T18:41:29Z

@sylus I'm not from Azure, but still would like to hear more details...

Did the node shutdown or bugcheck, or is Kubernetes just out of sync? Are there any memory dumps?
Is the node still visible in Kubernetes (even if "Not ready")?
What does kubectl describe node/<name> show?
What was running/being scheduled on the node at the time of failure?

tonyschollum · 2018-11-21T18:43:16Z

I requested a private patch from Azure to address the issue. Basically the VM was blue screening. There is supposed to be a public patch in an update late November.

On Thu, 22 Nov 2018 at 6:48 AM, William H ***@***.***> wrote: So I am getting random windows fails to and had to the steps that @qmalexander <https://github.com/qmalexander> gave to recover the node. Does anyone have any idea what is going on? This is starting to get really concerning that a node can just randomly fail and become not ready. Thinking i should file a critical issue against this for Azure. Steps: #4046 (comment) <https://github.com/Azure/acs-engine/issues/4046#issuecomment-436539771> Small example of my logs when things fail: belet.go:442: Failed to list *v1.Service: Get https://172.20.205.4:443/api/v1/se rvices?limit=500&resourceVersion=0: dial tcp 172.20.205.4:443: connectex: A socket operation was attempted to an unreachable network. E1121 17:00:25.439784 6156 reflector.go:134] k8s.io/kubernetes/pkg/kubelet/co nfig/apiserver.go:47 <http://k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47>: Failed to list *v1.Pod: Get https://172.20.205.4:443/api/v 1/pods <https://172.20.205.4:443/api/v1/pods>?fieldSelector=spec.nodeName%3D36274k8s9010&limit=500&resourceVersion=0: d ial tcp 172.20.205.4:443: connectex: A socket operation was attempted to an unre achable network. E1121 17:00:25.460738 6156 kubelet.go:2236] node "36274k8s9010" not found E1121 17:00:25.562526 6156 kubelet.go:2236] node "36274k8s9010" not found E1121 17:00:25.662861 6156 kubelet.go:2236] node "36274k8s9010" not found E1121 17:00:25.763012 6156 kubelet.go:2236] node "36274k8s9010" not found E1121 17:00:25.774939 6156 event.go:212] Unable to write event: 'Post https://172.20.205.4:443/api/v1/namespaces/default/events: dial tcp 172.20.205.4:443: connectex: A socket operation was attempted to an unreachable network.' (may retr y after sleeping) E1121 17:00:25.863233 6156 kubelet.go:2236] node "36274k8s9010" not found E1121 17:00:25.964231 6156 kubelet.go:2236] node "36274k8s9010" not found E1121 17:00:26.065082 6156 kubelet.go:2236] node "36274k8s9010" not found E1121 17:00:26.165977 6156 kubelet.go:2236] node "36274k8s9010" not found E1121 17:00:26.266949 6156 kubelet.go:2236] node "36274k8s9010" not found E1121 17:00:26.367541 6156 kubelet.go:2236] node "36274k8s9010" not found — You are receiving this because you commented. Reply to this email directly, view it on GitHub <https://github.com/Azure/acs-engine/issues/4046#issuecomment-440754765>, or mute the thread <https://github.com/notifications/unsubscribe-auth/APIQYVXUfldnDkWHvUhw_ZtkGUSs0y6jks5uxZH6gaJpZM4Xe1h1> .

-- [image: logo] <http://re-leased.com/> Tony Schollum Senior Developer, Team Lead www.re-leased.com <https://www.linkedin.com/company/re-leased> <https://twitter.com/re_leased> Emerging Add-on Partner of the Year Xerocon - winner

sylus · 2018-11-21T19:05:44Z

Hey @daschott thanks so much!

Did the node shutdown or bugcheck, or is Kubernetes just out of sync? Are there any memory dumps?

The node is still up but wasn't able to connect even w/RDP, had to use serial console to login and resolve issues.

Is the node still visible in Kubernetes (even if "Not ready")?

The node is visible under kubectl get nodes but Status Not Ready

What does kubectl describe node/ show?

I fixed the issue but when happens again i'll post here.

What was running/being scheduled on the node at the time of failure?

For 5 days we were running a .NET Framework app comprising about 8 micro-services.

sylus · 2018-11-27T05:21:06Z

I did notice that Datacenter-Core-1809-with-Containers-smalldisk is now on the list of published images and is the equivalent build of Windows 2019. Thinking this would have the fix?

Only issue currently against that SKU is:

Events:
  Type     Reason                  Age               From                   Message
  ----     ------                  ----              ----                   -------
  Normal   Scheduled               5m                default-scheduler      Successfully assigned default/ndmdotnet-admintool-5fb787d6f6-zws7l to 16822k8s9010
  Warning  FailedCreatePodSandBox  3s (x24 over 5m)  kubelet, 16822k8s9010  Failed create pod sandbox: rpc error: code = Unknown desc = failed pulling image "kubletwin/pause": Error response from daemon: repository kubletwin/pause not found: does not exist or no pull access

zhiweiv · 2018-11-28T10:31:00Z

Looks like 2019 is not ready yet, I got same error today, see Azure/acs-engine#4299.

daschott · 2019-01-11T21:05:06Z

Can you please use azure CNI v1.0.16 and confirm whether this issue is resolved?

Also, for node stability issues for WS2019, there is a hotfix that should come out on 01/29 as KB4476976.

wadoodrahimi · 2019-02-13T11:55:13Z

Hello evryone,

I ran docker CE version 18.09.1 stable on Window 10 pro version 1809 and I activated the Swarm mode. I can create service but I can not connect to Overlay Network.

docker service create -d --name web --endpoint-mode dnsrr --publish mode=host,target=80,published=80 microsoft/iis

it works!

but if I try to connect to an Overlay Network it fails

docker service create -d --name web --endpoint-mode dnsrr --publish mode=host,target=80,published=80 --network my-net microsoft/iis

The error is: Hns failed with error : an adapter was not found

Here are the Networks:

NETWORK ID NAME DRIVER SCOPE

d1d1547a5d5e Standardswitch ics local

b3ad81976542 host transparent local

b5i0zkys9nsj ingress overlay swarm

5d0212aab409 nat nat local

ydkbf4vfnkqh my-net overlay swarm

039bdb7cc276 none null local

zhiweiv · 2019-02-13T12:00:44Z

This issue is about k8s with azure cni, I think it is not related to your case.

welcome · 2019-03-22T16:58:29Z

👋 Thanks for opening your first issue here! If you're reporting a 🐞 bug, please make sure you include steps to reproduce it.

stale · 2019-05-21T17:00:48Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

brobichaud · 2019-05-23T16:15:11Z

I've not seen this issue in quite a while. Later builds of ASK-Engine seem to have fixed it.

mboersma transferred this issue from Azure/acs-engine Mar 22, 2019

stale bot added the stale label May 21, 2019

brobichaud closed this as completed May 23, 2019

Navigation Menu

K8s Windows 1803 pod creation fails with "HNS failed with error : Element not found" #853

K8s Windows 1803 pod creation fails with "HNS failed with error : Element not found" #853

Comments

brobichaud commented Oct 16, 2018

Is this a request for help?: Yes

Is this an ISSUE or FEATURE REQUEST? (choose one): Issue

What version of acs-engine?: 22.1

sharmasushant commented Oct 16, 2018 • edited

madhanrm commented Oct 16, 2018

brobichaud commented Oct 16, 2018

brobichaud commented Oct 16, 2018

madhanrm commented Oct 16, 2018

madhanrm commented Oct 16, 2018

sharmasushant commented Oct 16, 2018

brobichaud commented Oct 16, 2018

madhanrm commented Oct 16, 2018

madhanrm commented Oct 16, 2018

brobichaud commented Oct 16, 2018

madhanrm commented Oct 16, 2018

brobichaud commented Oct 16, 2018 • edited

madhanrm commented Oct 16, 2018

madhanrm commented Oct 16, 2018

brobichaud commented Oct 16, 2018

tonyschollum commented Oct 17, 2018

zhiweiv commented Oct 17, 2018 • edited

madhanrm commented Oct 17, 2018

zhiweiv commented Oct 18, 2018

zhiweiv commented Oct 19, 2018 • edited

qmalexander commented Nov 7, 2018

qmalexander commented Nov 9, 2018

zhiweiv commented Nov 14, 2018

daschott commented Nov 14, 2018

zhiweiv commented Nov 15, 2018 • edited

sylus commented Nov 21, 2018

sylus commented Nov 21, 2018

daschott commented Nov 21, 2018

tonyschollum commented Nov 21, 2018 via email

sylus commented Nov 21, 2018 • edited

sylus commented Nov 27, 2018

zhiweiv commented Nov 28, 2018

daschott commented Jan 11, 2019 • edited

wadoodrahimi commented Feb 13, 2019

zhiweiv commented Feb 13, 2019

welcome bot commented Mar 22, 2019

stale bot commented May 21, 2019

brobichaud commented May 23, 2019

Is this a request for help?:
Yes

Is this an ISSUE or FEATURE REQUEST? (choose one):
Issue

What version of acs-engine?:
22.1

sharmasushant commented Oct 16, 2018 •

edited

brobichaud commented Oct 16, 2018 •

edited

zhiweiv commented Oct 17, 2018 •

edited

zhiweiv commented Oct 19, 2018 •

edited

zhiweiv commented Nov 15, 2018 •

edited

sylus commented Nov 21, 2018 •

edited

daschott commented Jan 11, 2019 •

edited