Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dockerd lockup - AKS 1.11.5+1.11.7, kernel 4.15, with Moby 3.0.1+3.0.5, #838

Closed
jnoller opened this issue Feb 19, 2019 · 5 comments
Closed

Comments

@jnoller
Copy link
Contributor

jnoller commented Feb 19, 2019

Known issue tracking bug.

User(s) have reported an intermittent issue where the dockerd daemon (Moby 3.0.1+3.0.5) will enter a hard lock, uninterruptible state rendering worker nodes unrecoverable until a forced reboot. This is due to a bug in the linux kernel. Links to the moby and ubuntu bugs is below

Please see the detailed bug reports on the Moby and Ubuntu repos

moby/moby#38750
https://bugs.launchpad.net/ubuntu/+source/linux-azure/+bug/1802021

Thanks to alanjcastonguay for the support ticket and report that helped us chase the issue down.

@ellieayla
Copy link

The fix for https://bugs.launchpad.net/ubuntu/+source/linux-azure/+bug/1802021 is shipped in linux-azure package version 4.15.0.1040.44. AKS nodes have unattended-upgrade enabled, so should download and install the new package. Switching to the new kernel version requires a reboot, and the node won't be rebooted automatically. Reboot via portal.azure.com or az aks upgrade or az vm restart or az vm run-command invoke --command-id RunShellScript --scripts "reboot &"

  • apt-cache policy linux-azure will list currently-installed & proposed versions of the package. If Installed: 4.15.0.1039.43 + Candidate: 4.15.0.1040.44, wait another day or run /usr/bin/unattended-upgrade now to install the candidate.

  • uname -r and kubectl get nodes -o wide will report the currently-running kernel. If they're showing an older kernel than apt-cache policy linux-azure reports installed, reboot the node.

@ellieayla
Copy link

No outages observed in the last 6 days running 4.15.0-1040-azure. The longest window between outages prior was 4 days. I want at least 2 weeks of uptime before closing this ticket as fixed.

@ellieayla
Copy link

I have observed 2 weeks of uptime on 8 nodes without observation of the original symptoms since upgrading the AKS node kernel to 4.15.0-1040-azure. I am confident the kernel patch has resolved our problem.

@jnoller: This GitHub issue can be closed.

@jnoller
Copy link
Contributor Author

jnoller commented Mar 21, 2019

@jnoller jnoller closed this as completed Mar 21, 2019
@ellieayla
Copy link

4.15.0-1040-azure+ is apparently not the default for new nodes following a scale-up event in (some subset of) AKS clusters. A new node added to an existing cluster on 2019-04-05 got linux kernel 4.15.0-1037-azure, complete with reproduction of this bug. Newer kernel versions were downloaded/installed, but the node was never rebooted (manually or otherwise). 10 days of uptime => boom.

I will be adding more https://github.com/weaveworks/kured

@ghost ghost locked as resolved and limited conversation to collaborators Jul 25, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants