Kubernetes 1.11 cluster node lost its InternalIP #3503

jackfrancis · 2018-07-18T16:35:09Z

See:

azureuser@k8s-master-34767551-0:~$ kubectl describe node k8s-agentmd-34767551-1 | grep 'Addresses:' -A 2
Addresses:
  InternalIP:  10.239.0.128
  Hostname:    k8s-agentmd-34767551-1
azureuser@k8s-master-34767551-0:~$ kubectl describe node k8s-agentmd-34767551-2 | grep 'Addresses:' -A 2
Addresses:
  Hostname:  k8s-agentmd-34767551-2
Capacity:

The text was updated successfully, but these errors were encountered:

jackfrancis · 2018-07-18T16:36:45Z

k8s-agentmd-34767551-1 above is an example of what we expect to see, i.e., it has an InternalIP value. k8s-agentmd-34767551-2 is missing an InternalIP.

This from a v1.11.0 cluster w/ 3 masters and 2 agent pools, each agent pool of which has 3 nodes.

UnwashedMeme · 2018-08-20T17:45:09Z

This seems similar to #2388. I don't think it's related to Azure-CNI; I'm seeing the same with cilium as the cni plugin.

Another piece I'm seeing is that the kubelet of the node that does have its InternalIP has the log message Using Node Hostname from cloudprovider: "k8s-main-40037667-vmss000000" repeated regularly, the one missing InternalIP that log message went missing.

$ kubectl describe node  | grep -A 2 Addresses
Addresses:
  InternalIP:  172.20.24.4
  Hostname:    k8s-main-40037667-vmss000000
--
Addresses:
  Hostname:  k8s-main-40037667-vmss000001
Capacity:
--
Addresses:
  InternalIP:  172.20.24.240
  Hostname:    k8s-master-40037667-0

## Healthy node, the datestamp is now
$ ssh 172.20.24.4 journalctl -u kubelet | grep Hostname | tail -n 1
Aug 20 16:11:26 k8s-main-40037667-vmss000000 kubelet[1646]: I0820 16:11:26.275734    1646 kubelet_node_status.go:546] Using Node Hostname from cloudprovider: "k8s-main-40037667-vmss000000"

## Unhealthy node, the datestamp is 11hrs ago.
$ ssh 172.20.24.5 journalctl -u kubelet | grep Hostname | tail -n 1
Aug 20 05:16:18 k8s-main-40037667-vmss000001 kubelet[1907]: I0820 05:16:18.225480    1907 kubelet_node_status.go:546] Using Node Hostname from cloudprovider: "k8s-main-40037667-vmss000001"

It normally repeats every ~10s or so.

Meanwhile cilium is giving me a log message

$ kubectl -n kube-system logs --timestamps=true cilium-n79rs  | grep "Ignoring invalid node IP"| head -n 1
2018-08-20T05:15:01.0733129Z level=warning msg="Ignoring invalid node IP" ipAddr= k8sNodeID=5229570d-9ab4-11e8-b5ae-000d3a012abe nodeName=k8s-main-40037667-vmss000001 subsys=k8s type=InternalIP

From their code this happens when the InternalIp is nil.

It looks to me like the cloud provider is, at some point, failing to find and correctly report the host internalip. An oddity is that the cilium messages about receiving bad updates starts about 80s earlier than the last update. Maybe both are reacting to something else?

stieler-it · 2018-09-01T09:29:45Z

We probably experience a similar issue with missing InternalIP and ExternalIP entries for 2 of 5 nodes in an environment with OpenStack cloud provider.

>kubectl describe node intern-master1
...
Addresses:
  Hostname:  intern-master1

>kubectl describe node intern-worker1
...
Addresses:
  InternalIP:  10.0.0.15
  ExternalIP:  (hidden)
  Hostname:    intern-worker1

It worked a week ago, so the addresses have been lost on the way.

jackfrancis · 2018-09-04T16:08:01Z

@feiskyer @andyzhangx Just curious: is this a symptom we're tracking in some fashion? Have you heard about it? Thanks!

jackfrancis · 2018-09-04T16:13:43Z

Also @weinong notes that these symptoms have been periodically observed in AKS. We will look for patterns.

feiskyer · 2018-09-05T02:48:46Z

Just curious: is this a symptom we're tracking in some fashion? Have you heard about it? Thanks!

No, didn't notice the issue before. Let me have a check what's possibly wrong.

mitchellmaler · 2018-10-05T16:03:18Z

@stieler-it After messing around with Openstack I think I found a workaround. If you have the configdrive feature enabled in your openstack cluster you can then provision the nodes with that enabled and setup the cloud-config to only search using configdrive. This will then bypass the metadata api lookup which is what is failing. I am still letting my cluster run a few days but so far the internal-ip isn't missing.

weinong · 2018-10-25T18:12:53Z

kubernetes/kubernetes#70246

feiskyer · 2018-12-03T14:16:59Z

The fix is cherry-picking to 1.11 release kubernetes/kubernetes#70400. Still pending now.

stale · 2019-03-09T05:23:09Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contribution. Note that acs-engine is deprecated--see https://github.com/Azure/aks-engine instead.

jackfrancis added orchestrator/k8s kind/bug labels Jul 18, 2018

jackfrancis mentioned this issue Jul 18, 2018

k8s w/ Azure CNI agent node lost InternalIP Azure/azure-container-networking#198

Closed

stieler-it mentioned this issue Sep 1, 2018

Kubernetes nodes lose InternalIP and ExternalIP temporarily kubernetes/cloud-provider-openstack#280

Closed

stieler-it mentioned this issue Sep 5, 2018

One Node loose private and external ip address kubernetes/kubernetes#68270

Closed

feiskyer mentioned this issue Sep 26, 2018

Node's internal IP is missing sometimes kubernetes/kubernetes#69075

Closed

David-Green mentioned this issue Nov 22, 2018

WEU cluster experiencing new issue with "server misbehaving" Azure/AKS#305

Closed

DenisBiondic mentioned this issue Dec 3, 2018

kubectl can't connect to one of the nodes Azure/AKS#738

Closed

choovick mentioned this issue Dec 3, 2018

Heapster can't find Node IP Addresses Azure/AKS#665

Closed

damienwebdev mentioned this issue Feb 19, 2019

Significant Connectivity Drops between AKS clusters and an Azure Database for MySQL Azure/AKS#743

Closed

stale bot added the stale label Mar 9, 2019

stale bot closed this as completed Mar 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kubernetes 1.11 cluster node lost its InternalIP #3503

Kubernetes 1.11 cluster node lost its InternalIP #3503

jackfrancis commented Jul 18, 2018

jackfrancis commented Jul 18, 2018

UnwashedMeme commented Aug 20, 2018

stieler-it commented Sep 1, 2018

jackfrancis commented Sep 4, 2018

jackfrancis commented Sep 4, 2018

feiskyer commented Sep 5, 2018

mitchellmaler commented Oct 5, 2018

weinong commented Oct 25, 2018

feiskyer commented Dec 3, 2018

stale bot commented Mar 9, 2019

Kubernetes 1.11 cluster node lost its InternalIP #3503

Kubernetes 1.11 cluster node lost its InternalIP #3503

Comments

jackfrancis commented Jul 18, 2018

jackfrancis commented Jul 18, 2018

UnwashedMeme commented Aug 20, 2018

stieler-it commented Sep 1, 2018

jackfrancis commented Sep 4, 2018

jackfrancis commented Sep 4, 2018

feiskyer commented Sep 5, 2018

mitchellmaler commented Oct 5, 2018

weinong commented Oct 25, 2018

feiskyer commented Dec 3, 2018

stale bot commented Mar 9, 2019