Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NAP is creating NodeClaims and VMs but not registering Nodes to the cluster #248

Closed
WillEdMatsypie opened this issue Apr 3, 2024 · 5 comments
Assignees
Labels
area/bootstrap Issues or PRs related to bootstrap area/provisioning Issues or PRs related to provisioning (instance provider) area/security Issues or PRs related to security triage/duplicate Indicates an issue is a duplicate of other open issue.

Comments

@WillEdMatsypie
Copy link

Version

Karpenter Overlay Version: N/A - we are using managed NAP

Kubernetes Version: v1.28.5

Expected Behavior

Karpenter should create NodeClaims and provision VMs which then register to the cluster and are capable of scheduling Pods.

Actual Behavior

Pods go into pending state and create a NodeClaim with Karpenter i.e.
image

The NodeClaims then create and provision a VM but never reach Ready state
image

The reason given in the Claim is NodeNotFound:
image

We can see the VMs are up and running in the account:
image

Steps to Reproduce the Problem

N/A - this is a managed service. In our cluster this was working until yesterday afternoon and then Karpenter stopped being able to register Nodes with the cluster altogether.

Resource Specs and Logs

N/A - issue is with managed NAP not registering nodes to the cluster, no logs we can see outside of those shared in the "Actual behaviour" above.

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@Bryce-Soghigian
Copy link
Collaborator

I suspect its related to #247

@Bryce-Soghigian
Copy link
Collaborator

Will take a look today

@Bryce-Soghigian Bryce-Soghigian self-assigned this Apr 3, 2024
@Bryce-Soghigian
Copy link
Collaborator

Bryce-Soghigian commented Apr 3, 2024

I took a look at your cluster.

2024-04-01T00:37:27.778Z we see that a NodeImageUpgrade occurred on the agentpool, which triggered #247 where we see the bootstrap token isn't rotated when one of the agentpools is upgraded. We can fix this by doing az aks update -g rg -n clusterName, which simply reapplys the token and passes that configuration to NAP.

I have mitigated the problem for you by running the az aks update -g rg -n clusterName with a powerful azure employee command.

The fix for this on the nap side is currently rolling out, but this should fix things for now

@WillEdMatsypie
Copy link
Author

Awesome, let us know when it is fully rolled out but cheers for the support!

@Bryce-Soghigian
Copy link
Collaborator

@WillEdMatsypie thank you for opening an issue and making us aware of the problem! Please feel free to keep opening more anytime you experience some problems!

@tallaxes tallaxes added triage/duplicate Indicates an issue is a duplicate of other open issue. area/bootstrap Issues or PRs related to bootstrap area/security Issues or PRs related to security area/provisioning Issues or PRs related to provisioning (instance provider) labels Apr 3, 2024
@tallaxes tallaxes closed this as not planned Won't fix, can't repro, duplicate, stale Apr 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/bootstrap Issues or PRs related to bootstrap area/provisioning Issues or PRs related to provisioning (instance provider) area/security Issues or PRs related to security triage/duplicate Indicates an issue is a duplicate of other open issue.
Projects
None yet
Development

No branches or pull requests

3 participants