Skip to content
This repository has been archived by the owner on Jan 11, 2023. It is now read-only.

Kubernetes 1.11 cluster node lost its InternalIP #3503

Closed
jackfrancis opened this issue Jul 18, 2018 · 10 comments
Closed

Kubernetes 1.11 cluster node lost its InternalIP #3503

jackfrancis opened this issue Jul 18, 2018 · 10 comments

Comments

@jackfrancis
Copy link
Member

See:

azureuser@k8s-master-34767551-0:~$ kubectl describe node k8s-agentmd-34767551-1 | grep 'Addresses:' -A 2
Addresses:
  InternalIP:  10.239.0.128
  Hostname:    k8s-agentmd-34767551-1
azureuser@k8s-master-34767551-0:~$ kubectl describe node k8s-agentmd-34767551-2 | grep 'Addresses:' -A 2
Addresses:
  Hostname:  k8s-agentmd-34767551-2
Capacity:
@jackfrancis
Copy link
Member Author

k8s-agentmd-34767551-1 above is an example of what we expect to see, i.e., it has an InternalIP value. k8s-agentmd-34767551-2 is missing an InternalIP.

This from a v1.11.0 cluster w/ 3 masters and 2 agent pools, each agent pool of which has 3 nodes.

@UnwashedMeme
Copy link

This seems similar to #2388. I don't think it's related to Azure-CNI; I'm seeing the same with cilium as the cni plugin.

Another piece I'm seeing is that the kubelet of the node that does have its InternalIP has the log message Using Node Hostname from cloudprovider: "k8s-main-40037667-vmss000000" repeated regularly, the one missing InternalIP that log message went missing.

$ kubectl describe node  | grep -A 2 Addresses
Addresses:
  InternalIP:  172.20.24.4
  Hostname:    k8s-main-40037667-vmss000000
--
Addresses:
  Hostname:  k8s-main-40037667-vmss000001
Capacity:
--
Addresses:
  InternalIP:  172.20.24.240
  Hostname:    k8s-master-40037667-0

## Healthy node, the datestamp is now
$ ssh 172.20.24.4 journalctl -u kubelet | grep Hostname | tail -n 1
Aug 20 16:11:26 k8s-main-40037667-vmss000000 kubelet[1646]: I0820 16:11:26.275734    1646 kubelet_node_status.go:546] Using Node Hostname from cloudprovider: "k8s-main-40037667-vmss000000"

## Unhealthy node, the datestamp is 11hrs ago.
$ ssh 172.20.24.5 journalctl -u kubelet | grep Hostname | tail -n 1
Aug 20 05:16:18 k8s-main-40037667-vmss000001 kubelet[1907]: I0820 05:16:18.225480    1907 kubelet_node_status.go:546] Using Node Hostname from cloudprovider: "k8s-main-40037667-vmss000001"

It normally repeats every ~10s or so.

Meanwhile cilium is giving me a log message

$ kubectl -n kube-system logs --timestamps=true cilium-n79rs  | grep "Ignoring invalid node IP"| head -n 1
2018-08-20T05:15:01.0733129Z level=warning msg="Ignoring invalid node IP" ipAddr= k8sNodeID=5229570d-9ab4-11e8-b5ae-000d3a012abe nodeName=k8s-main-40037667-vmss000001 subsys=k8s type=InternalIP

From their code this happens when the InternalIp is nil.

It looks to me like the cloud provider is, at some point, failing to find and correctly report the host internalip. An oddity is that the cilium messages about receiving bad updates starts about 80s earlier than the last update. Maybe both are reacting to something else?

@stieler-it
Copy link

We probably experience a similar issue with missing InternalIP and ExternalIP entries for 2 of 5 nodes in an environment with OpenStack cloud provider.

>kubectl describe node intern-master1
...
Addresses:
  Hostname:  intern-master1

>kubectl describe node intern-worker1
...
Addresses:
  InternalIP:  10.0.0.15
  ExternalIP:  (hidden)
  Hostname:    intern-worker1

It worked a week ago, so the addresses have been lost on the way.

@jackfrancis
Copy link
Member Author

@feiskyer @andyzhangx Just curious: is this a symptom we're tracking in some fashion? Have you heard about it? Thanks!

@jackfrancis
Copy link
Member Author

Also @weinong notes that these symptoms have been periodically observed in AKS. We will look for patterns.

@feiskyer
Copy link
Member

feiskyer commented Sep 5, 2018

Just curious: is this a symptom we're tracking in some fashion? Have you heard about it? Thanks!

No, didn't notice the issue before. Let me have a check what's possibly wrong.

@mitchellmaler
Copy link

@stieler-it After messing around with Openstack I think I found a workaround. If you have the configdrive feature enabled in your openstack cluster you can then provision the nodes with that enabled and setup the cloud-config to only search using configdrive. This will then bypass the metadata api lookup which is what is failing. I am still letting my cluster run a few days but so far the internal-ip isn't missing.

@weinong
Copy link
Contributor

weinong commented Oct 25, 2018

@feiskyer
Copy link
Member

feiskyer commented Dec 3, 2018

The fix is cherry-picking to 1.11 release kubernetes/kubernetes#70400. Still pending now.

@stale
Copy link

stale bot commented Mar 9, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contribution. Note that acs-engine is deprecated--see https://github.com/Azure/aks-engine instead.

@stale stale bot added the stale label Mar 9, 2019
@stale stale bot closed this as completed Mar 16, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants