K8s Windows 1803 pod creation fails with "HNS failed with error : Element not found" #853
Comments
@jhowardmsft @madhanrm @dineshgovindasamy Guys, could it be related to HNS along with docker 1.13 as well? |
Please collect the logs using https://github.com/Microsoft/SDN/blob/master/Kubernetes/windows/debug/collectlogs.ps1 I suspect this to be the out of sync issue - ie Azure CNI going out of sync with HNS. |
OK, ran the diagnostics and gathered azure-vnet.log on one of the affected nodes. They are attached. |
BTW, on the sig-windows slack channel someone else has seen the same problem. He said the following:
I haven't tried that as I don't know how to reset the HNS network. |
@sharmasushant can you add code in Azure CNI, to check if the network exists, before creating an endpoint. If not, can you re-create the network. On host restart or Kubelet restart kubeletstart.ps1 would attempt to cleanup the HNS Network and remove the json files. If the json files stick around, then azure cni might think that the network is available, and might try creating an endpoint using it, and ending up with this issue. |
I don't see the error described in this issue in azure-vnet.log From the diag logs. Name Type ID AddressPrefix ext L2Bridge 24a89481-38bb-4bf2-b6c0-52fdd0019708 |
@madhanrm @saiyan86 can you please take up this change? Lets check for network existence while creating an endpoint. @madhanrm please understand that this adds overhead to endpoint creation, and we would like to avoid this. |
If needed I can repro the problem on this node by removing the taint. Separately, is there a way I can manually fix this problem so I can repair my cluster? |
Try to stop Kubelet service, see if any azure-vnet.json file exist, if yes, delete them, and then start the Kubelet service |
@sharmasushant Going forward with RS5 & using the v2 HNS RPC apis, it is mandatory to Open the network (to make sure it exists) & Create endpoint on that network. Just making sure the network exists is not going to cost much for v1 apis |
@madhanrm what's the correct way to stop the kubelet? using "sc stop kubelet" results in the error:
|
Stop-Service Kubelet -Force <del azure-vnet.json file, if exist> Start-Service kubelet |
OK, that let me stop and start the kubelet. But after doing that I am now seeing this error when I try to schedule a pod:
Also, I do not see a new azure-vnet.json being created in c:\k. But I do see an azure-vnet.json.lock. Here is azure-vnet.log: |
Can you try deleting below 2 files, while stopping kubelet <del azure-vnet.json file, if exist> |
Please delete the lock files as well. |
Ahhh, that did the trick nicely. Thank you so much for saving my cluster. :-) Is there anything I have done to trigger this? I'm wondering why others have not seen it. And more importantly, is there anything I can do to avoid this? I know how to fix it now, but it's surely a pain. |
@madhanrm you are a life saver!!! |
My Windows node crashed again just now, after I created a new pod. All pods on that node became error. Some logs here. I am using acs engine 0.23.1, k8s 1.12.1 |
@zhiweiv can you also provide the kubelet.err.log ? |
I built a new cluster, issue occurred again. It is very like the reproduce steps in #3920. Thanks |
Hi @zhiweiv,
|
@zhiweiv I have a question. Du you have a resource limit on the containers? |
Thanks @qmalexander , the previous cluster was deleted, I will try the solution if it occurs again. |
@zhiweiv deleting the azure-vnet*[.lock|.json] in C:\k directory didn't help you? Which Azure CNI version is this? |
@daschott From the symptom I met several times before, when there are about 10 pods or more in a node, the issue occur at uncertain time. Now I keep less than 6 pods in each node carefully, no new issue since that time. I plan to try more than 10 pods in one node with k8s/Windows Server 2019 when they are available. |
So I am getting random windows fails to and had to the steps that @qmalexander gave to recover the node. Does anyone have any idea what is going on? This is starting to get really concerning that a node can just randomly fail and become not ready. Thinking i should file a critical issue against this for Azure. Steps: https://github.com/Azure/acs-engine/issues/4046#issuecomment-436539771 Small example of my logs when things fail: belet.go:442: Failed to list *v1.Service: Get https://172.20.205.4:443/api/v1/se
rvices?limit=500&resourceVersion=0: dial tcp 172.20.205.4:443: connectex: A socket operation was attempted to an unreachable network.
E1121 17:00:25.439784 6156 reflector.go:134] k8s.io/kubernetes/pkg/kubelet/co
nfig/apiserver.go:47: Failed to list *v1.Pod: Get https://172.20.205.4:443/api/v
1/pods?fieldSelector=spec.nodeName%3D36274k8s9010&limit=500&resourceVersion=0: d
ial tcp 172.20.205.4:443: connectex: A socket operation was attempted to an unre
achable network.
E1121 17:00:25.460738 6156 kubelet.go:2236] node "36274k8s9010" not found
E1121 17:00:25.562526 6156 kubelet.go:2236] node "36274k8s9010" not found
E1121 17:00:25.662861 6156 kubelet.go:2236] node "36274k8s9010" not found
E1121 17:00:25.763012 6156 kubelet.go:2236] node "36274k8s9010" not found
E1121 17:00:25.774939 6156 event.go:212] Unable to write event: 'Post https:/
/172.20.205.4:443/api/v1/namespaces/default/events: dial tcp 172.20.205.4:443: c
onnectex: A socket operation was attempted to an unreachable network.' (may retr
y after sleeping)
E1121 17:00:25.863233 6156 kubelet.go:2236] node "36274k8s9010" not found
E1121 17:00:25.964231 6156 kubelet.go:2236] node "36274k8s9010" not found
E1121 17:00:26.065082 6156 kubelet.go:2236] node "36274k8s9010" not found
E1121 17:00:26.165977 6156 kubelet.go:2236] node "36274k8s9010" not found
E1121 17:00:26.266949 6156 kubelet.go:2236] node "36274k8s9010" not found
E1121 17:00:26.367541 6156 kubelet.go:2236] node "36274k8s9010" not found |
@PatrickLang would you have any idea on this problem? |
@sylus I'm not from Azure, but still would like to hear more details...
|
I requested a private patch from Azure to address the issue. Basically the
VM was blue screening. There is supposed to be a public patch in an update
late November.
On Thu, 22 Nov 2018 at 6:48 AM, William H ***@***.***> wrote:
So I am getting random windows fails to and had to the steps that
@qmalexander <https://github.com/qmalexander> gave to recover the node.
Does anyone have any idea what is going on? This is starting to get really
concerning that a node can just randomly fail and become not ready.
Thinking i should file a critical issue against this for Azure.
Steps: #4046 (comment)
<https://github.com/Azure/acs-engine/issues/4046#issuecomment-436539771>
Small example of my logs when things fail:
belet.go:442: Failed to list *v1.Service: Get https://172.20.205.4:443/api/v1/se
rvices?limit=500&resourceVersion=0: dial tcp 172.20.205.4:443: connectex: A socket operation was attempted to an unreachable network.
E1121 17:00:25.439784 6156 reflector.go:134] k8s.io/kubernetes/pkg/kubelet/co
nfig/apiserver.go:47 <http://k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47>: Failed to list *v1.Pod: Get https://172.20.205.4:443/api/v
1/pods <https://172.20.205.4:443/api/v1/pods>?fieldSelector=spec.nodeName%3D36274k8s9010&limit=500&resourceVersion=0: d
ial tcp 172.20.205.4:443: connectex: A socket operation was attempted to an unre
achable network.
E1121 17:00:25.460738 6156 kubelet.go:2236] node "36274k8s9010" not found
E1121 17:00:25.562526 6156 kubelet.go:2236] node "36274k8s9010" not found
E1121 17:00:25.662861 6156 kubelet.go:2236] node "36274k8s9010" not found
E1121 17:00:25.763012 6156 kubelet.go:2236] node "36274k8s9010" not found
E1121 17:00:25.774939 6156 event.go:212] Unable to write event: 'Post https://172.20.205.4:443/api/v1/namespaces/default/events: dial tcp 172.20.205.4:443: connectex: A socket operation was attempted to an unreachable network.' (may retr
y after sleeping)
E1121 17:00:25.863233 6156 kubelet.go:2236] node "36274k8s9010" not found
E1121 17:00:25.964231 6156 kubelet.go:2236] node "36274k8s9010" not found
E1121 17:00:26.065082 6156 kubelet.go:2236] node "36274k8s9010" not found
E1121 17:00:26.165977 6156 kubelet.go:2236] node "36274k8s9010" not found
E1121 17:00:26.266949 6156 kubelet.go:2236] node "36274k8s9010" not found
E1121 17:00:26.367541 6156 kubelet.go:2236] node "36274k8s9010" not found
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<https://github.com/Azure/acs-engine/issues/4046#issuecomment-440754765>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/APIQYVXUfldnDkWHvUhw_ZtkGUSs0y6jks5uxZH6gaJpZM4Xe1h1>
.
--
[image: logo] <http://re-leased.com/> Tony Schollum
Senior Developer, Team Lead
www.re-leased.com
<https://www.linkedin.com/company/re-leased> <https://twitter.com/re_leased>
Emerging Add-on Partner of the Year
Xerocon - winner
|
Hey @daschott thanks so much!
The node is still up but wasn't able to connect even w/RDP, had to use serial console to login and resolve issues.
The node is visible under kubectl get nodes but Status Not Ready
I fixed the issue but when happens again i'll post here.
For 5 days we were running a .NET Framework app comprising about 8 micro-services. |
I did notice that Datacenter-Core-1809-with-Containers-smalldisk is now on the list of published images and is the equivalent build of Windows 2019. Thinking this would have the fix? Only issue currently against that SKU is: Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 5m default-scheduler Successfully assigned default/ndmdotnet-admintool-5fb787d6f6-zws7l to 16822k8s9010
Warning FailedCreatePodSandBox 3s (x24 over 5m) kubelet, 16822k8s9010 Failed create pod sandbox: rpc error: code = Unknown desc = failed pulling image "kubletwin/pause": Error response from daemon: repository kubletwin/pause not found: does not exist or no pull access |
Looks like 2019 is not ready yet, I got same error today, see Azure/acs-engine#4299. |
Can you please use azure CNI v1.0.16 and confirm whether this issue is resolved? Also, for node stability issues for WS2019, there is a hotfix that should come out on 01/29 as KB4476976. |
Hello evryone, I ran docker CE version 18.09.1 stable on Window 10 pro version 1809 and I activated the Swarm mode. I can create service but I can not connect to Overlay Network. docker service create -d --name web --endpoint-mode dnsrr --publish mode=host,target=80,published=80 microsoft/iis it works! but if I try to connect to an Overlay Network it fails docker service create -d --name web --endpoint-mode dnsrr --publish mode=host,target=80,published=80 --network my-net microsoft/iis The error is: Hns failed with error : an adapter was not found Here are the Networks: NETWORK ID NAME DRIVER SCOPE d1d1547a5d5e Standardswitch ics local b3ad81976542 host transparent local b5i0zkys9nsj ingress overlay swarm 5d0212aab409 nat nat local ydkbf4vfnkqh my-net overlay swarm 039bdb7cc276 none null local |
This issue is about k8s with azure cni, I think it is not related to your case. |
👋 Thanks for opening your first issue here! If you're reporting a 🐞 bug, please make sure you include steps to reproduce it. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
I've not seen this issue in quite a while. Later builds of ASK-Engine seem to have fixed it. |
Is this a request for help?:
Yes
Is this an ISSUE or FEATURE REQUEST? (choose one):
Issue
What version of acs-engine?:
22.1
Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm)
Kubernetes
What happened:
I have a cluster w/ k8s 10.8.1 running Windows v1803 nodes. All was well for 6 days with a dozen of my pods deployed, then suddenly I start seeing the error below when deploying new versions of a pod:
I noticed these errors are only seen when k8s tries to schedule new pods on a specific node. I have 2 Windows nodes, one is fine the other exhibits this problem.
Since this seems network related I looked at the ipconfig /all results on each node and do not see a pattern. Here is the output in case it's useful.
First node exhibiting the problem:
Node that is working fine:
Second node exhibiting the problem:
What you expected to happen:
For the pods to be created on any Windows node in the cluster.
How to reproduce it (as minimally and precisely as possible):
I have a cluster that exhibits this now and can easily repro it. But I do not have steps to repro it on a new cluster, it feels random-ish.
Anything else we need to know:
Since I have no idea what causes this or how to fix it my only recourse was to taint the affected node so k8s stops trying to schedule to it. I then scaled my cluster up to get a new node on which I can deploy, and stopped the affected node. Since then the new node worked fine for a few days and then exhibited the same problem. I now have two tainted nodes turned off.
I'm very willing to work with someone that can help me debug this.
The text was updated successfully, but these errors were encountered: