Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
As you might imagine, if a cron restart would fix it, it would already been there ;)
DigitalOcean LoadBalancer doesn't support IPv6, so there is a Droplet on an IPv6 acting as one. It redirects traffic on port 80/443 to the kubernetes cluster, just like the DO LB would do. But for this it needs to know the internal IPs of all the kubernetes nodes. Sadly, these change every time you upgrade a node (as the old one is killed, and a new one is created).
You can make kubernetes call some API when ever a new node appears, and this might be a solution, but it requires some nifty scripting on the IPv6 Droplet. So far I haven't found a clean way that doesn't involve me writing a custom API handler :D
Updates are rare and far apart, but possibly I could get something in there that informs me when ever it lost connection to a kubernetes node .. and email me or something. Will look into that.
Problem still isn't really resolved. One of the two nodes is failing 50% of the healthchecks. Logs indicate nothing why this is happening. IPv4 traffic is now routed via the working node. IPv6 is still a bit touch&go.
Going to disable IPv6 for a bit to get some further information on the issue. Should be back within 10 minutes or so.
Edit: and it is enabled again; still with degraded performance
Further investigation shows that it is something deep in kubernetes. We were running on two nodes; now on three. The third node shows exactly the same issue as the second: kube-proxy is dropping connections randomly. The first however is working just fine.
I tried various of things, but nothing seems to change the situation.
I now degraded both LoadBalancer (IPv4 and IPv6) to only use the first node for kube-proxy for now. This seems to be working and stable. I reached out to DigitalOcean, see if they know what is going on.
If I cannot find a solution, I will upgrade kubernetes to 1.14 next week, in the hope that fixes it.
To be continued!
More updates, as updates are fun:
Turns out the CNI (flannel) lost its ways. 1 node has a subnet the others don't know about, and the others have one nobody knows about. So any traffic that needs to go to the first node from any other will get lost. Luckily the first node does have full view of everything, and this is also why the traffic currently is arriving where it should.
I updated my ticket with DigitalOcean. I am going to give them some time to look into this too, as it is a problem of the managed service they deliver (and not my/our mistake, basically). Hopefully they know how to resolve this cleanly, and otherwise I will be rebuilding the cluster (only takes ~15 minutes, so that is okay).
At least I now understand the issue. Just no clue how/what caused it. Hopefully DO can answer that.
Problem is resolved. Final recap:
We run Kubernetes on 2 nodes. On the 25th of Jul, at around 03:30 UTC, the host on which one of the nodes runs had hardware issues, and DigitalOcean wanted to migrate all Droplets away from it. This is pretty common, and what I expect of a managed service. During this migration, the node crashed and rebooted itself. Kubernetes has no problems losing nodes, and recovers nearly instant.
During the reboot, flannel assigned (most likely because a file on disk got corrupted) a new IP range to this node. But the docker on the node was still configured for the old IP. In result, services created on that node were bind on the wrong IP. Even reboots didn't fix this.
Initial the other node was rebooted, as that was showing issues. As it turns out, all connections that came in via the first node were routed just fine: he knew his own IP, and could route to the IP of the other node. The other was around however, was not working. In result, the healthy node appeared unhealthy, and the other way around.
After a long tracing session (so happy most docker images have a
As temporary mitigation, already a third node was added to the cluster. This heavily reduced the amount of errors, as odds of hitting a stale connection went from 75% to 22% (I won't bore you with the details why an additional node caused this). Later we simply told the LoadBalancers to ignore everything but the first (unhealthy) node. At least he could always route the traffic correctly.
As permanent solution, the unhealthy node was terminated. Kubernetes recovers from this easily, and all traffic is moving again how it should. The LoadBalancer now route their traffic via the other two nodes. Those two nodes to know how to reach each other, and everything is working as expected again.
Any other outages or problems, let me know!