Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Missing node-problem-detector and node-exporter on some AKSUbuntu images version #3988

Closed
grzesuav opened this issue Nov 9, 2023 · 12 comments
Assignees
Labels

Comments

@grzesuav
Copy link

grzesuav commented Nov 9, 2023

Describe the bug
There is no node-problem-detector binary on some AKSUbuntu images, affected versions:

  • AKSUbuntu-2204gen2containerd-202308.28..0
  • AKSUbuntu-2204gen2containerd-202310.19.2
  • AKSUbuntu-2204gen2containerd-202310.04.0

It degrades the visibility on the AKS nodes state as no conditions are set nor health checks executed.

To Reproduce

  • create a nodepool with i.e. kubernetes version 1.27, check if the image used is one of the above
  • exec into the node and check systemctl status node-problem-detector

Expected behavior
We expect that node-problem-detector binary will be installed and running as systemd service

Screenshots
If applicable, add screenshots to help explain your problem.

  • Good node (has binary)
image image
  • Bad node - missing binary
image image

Environment (please complete the following information):

  • Kubernetes version 1.27.3

Additional context
We are aware of https://security.snyk.io/vuln/SNYK-WOLFILATEST-NODEPROBLEMDETECTOR-5862811 , was it removed due to CVE ?
Why there is no information about node-problem-detector and node-exporter in release notes - https://github.com/Azure/AKS/tree/master/vhd-notes/aks-ubuntu/AKSUbuntu-2204 ?
Why not all AKS ubuntu images versions are present there ?

@grzesuav grzesuav added the bug label Nov 9, 2023
@grzesuav
Copy link
Author

grzesuav commented Nov 9, 2023

If the binaries are present :

node-exporter --version
node_exporter, version 1.5.0 (branch: HEAD, revision: 1b48970ffcf5630534fb00bb0687d73c66d1c959)
  build user:       root@836ae5c4ca52
  build date:       20221130-00:05:11
  go version:       go1.19.3
  platform:         linux/amd64

and

/usr/local/bin/node-problem-detector --version
v0.8.13

@aritraghosh
Copy link
Contributor

Please take a look here . You can create a support case if you have issues with your cluster

@grzesuav
Copy link
Author

Now we are seeing issue mostly with 'AKSUbuntu-2204gen2containerd-202310.19.2' Ubuntu image

@jkroepke
Copy link

jkroepke commented Nov 16, 2023

Expecting the same issue with AKSCBLMariner-V2gen2-202310.31.0

According the docs, the Node Problem Detector should be present: https://learn.microsoft.com/en-us/azure/aks/node-problem-detector

@aritraghosh Case ID 2311160050001448

@JohnRusk
Copy link
Member

JohnRusk commented Jan 15, 2024

We (AKS) are looking into this.

@jkroepke
Copy link

It seems that recent AKS image versions, the image does not exists

@JohnRusk
Copy link
Member

@jkroepke Node-Problem-Detector and Node-Export are not built in to the OS image. They are automatically installed when a node joins an AKS node pool. My team is looking into these reports that sometimes they are not automatically installed.

@grzesuav
Copy link
Author

grzesuav commented Jan 15, 2024

In recent versions it seems indeed to be automatically installed after 30-120 min after nodepool being provisioned. One issue is, when one node/VM is in failed state, the installation is skipped on all the nodes (from given node pool/VMSS)

@aritraghosh aritraghosh self-assigned this Feb 5, 2024
@c4milo
Copy link

c4milo commented Feb 8, 2024

I'm testing using Standard_L8s_v3 VMs and it seems the binary installed has the wrong architecture:

Feb 08 22:39:28 aks-default-14936645-vmss000002 node-problem-detector-startup.sh[1998451]: /usr/local/bin/node-problem-detector-startup.sh: 50: /usr/local/bin/node-problem-detector: Exec format error
Feb 08 22:39:28 aks-default-14936645-vmss000002 systemd[1]: node-problem-detector.service: Main process exited, code=exited, status=126/n/a
Feb 08 22:39:28 aks-default-14936645-vmss000002 systemd[1]: node-problem-detector.service: Failed with result 'exit-code'.

root@aks-default-14936645-vmss000002:/# /usr/local/bin/node-problem-detector
bash: /usr/local/bin/node-problem-detector: cannot execute binary file: Exec format error
root@aks-default-14936645-vmss000002:/# cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.3 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.3 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy

@grzesuav
Copy link
Author

grzesuav commented Feb 8, 2024

Regarding the original issue, I do not observe it anymore, npd is correctly installed via Linux Extension after 15-30n

@grzesuav
Copy link
Author

grzesuav commented Mar 1, 2024

I will close the issue now, we suspect it was cause by unattended upgrade of AKS Ubuntu image, we consider switch to Azure Linux

@grzesuav grzesuav closed this as completed Mar 1, 2024
@bravebeaver
Copy link

I'm testing using Standard_L8s_v3 VMs and it seems the binary installed has the wrong architecture:

Feb 08 22:39:28 aks-default-14936645-vmss000002 node-problem-detector-startup.sh[1998451]: /usr/local/bin/node-problem-detector-startup.sh: 50: /usr/local/bin/node-problem-detector: Exec format error
Feb 08 22:39:28 aks-default-14936645-vmss000002 systemd[1]: node-problem-detector.service: Main process exited, code=exited, status=126/n/a
Feb 08 22:39:28 aks-default-14936645-vmss000002 systemd[1]: node-problem-detector.service: Failed with result 'exit-code'.

root@aks-default-14936645-vmss000002:/# /usr/local/bin/node-problem-detector
bash: /usr/local/bin/node-problem-detector: cannot execute binary file: Exec format error
root@aks-default-14936645-vmss000002:/# cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.3 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.3 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy

hi @c4milo

sorry for the late update. for this particular issue on ARM64 there was a bug the in the artefact node-problem-detector v0.8.14. it will be fixed in the next release in the next week or 2. there is no further action required from your side and the updated version should be pulled in automatically.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants