-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(clusterapi): HasInstance with namespace prefix #6776
base: master
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: mweibel The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Hi @mweibel. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
c616556
to
332ac2c
Compare
ping @elmiko @MaxFedotov for authoring/reviewing the original PR #6708 |
@mweibel Thanks, that's my bad - I was looking as an example at autoscaler/cluster-autoscaler/cloudprovider/clusterapi/clusterapi_controller.go Lines 335 to 336 in 4f1c8e6
Although there is a test for this function, which passed without errors, there are two problems in it (which I missed completely):
While in Also, So your logic in this case is correct, thanks for finding this out! But to prevent such problems in the future, we need to update tests as well. @elmiko WDYT? I can make a separate issue and update all tests. |
@MaxFedotov thanks! I tried updating the tests by doing the following change: diff --git a/cluster-autoscaler/cloudprovider/clusterapi/clusterapi_controller_test.go b/cluster-autoscaler/cloudprovider/clusterapi/clusterapi_controller_test.go
index fc89fdae5..e374ea447 100644
--- a/cluster-autoscaler/cloudprovider/clusterapi/clusterapi_controller_test.go
+++ b/cluster-autoscaler/cloudprovider/clusterapi/clusterapi_controller_test.go
@@ -458,7 +458,8 @@ func makeLinkedNodeAndMachine(i int, namespace, clusterName string, owner metav1
ObjectMeta: metav1.ObjectMeta{
Name: fmt.Sprintf("%s-%s-node-%d", namespace, owner.Name, i),
Annotations: map[string]string{
- machineAnnotationKey: fmt.Sprintf("%s/%s-%s-machine-%d", namespace, namespace, owner.Name, i),
+ machineAnnotationKey: fmt.Sprintf("%s-%s-machine-%d", namespace, owner.Name, i),
+ clusterNamespaceAnnotationKey: namespace,
},
},
Spec: corev1.NodeSpec{ As a result |
/test pull-cluster-autoscaler-e2e-azure @mweibel I'm maintainining some Azure-infra cluster-autoscaler tests, one scenario of which includes the clusterapi provider (running in Azure via CAPZ), so this test run will get some basic signal against your change |
@MaxFedotov i definitely think it would be cool to make the tests more accurate. although, i don't want to break other stuff. @mweibel thanks for the PR, i'm still understanding the nuances that you and Max are talking about. but i will spend some time today/tomorrow reviewing this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this looks good, just a question about failures
@@ -86,8 +87,9 @@ func (p *provider) NodeGroupForNode(node *corev1.Node) (cloudprovider.NodeGroup, | |||
// HasInstance returns whether a given node has a corresponding instance in this cloud provider | |||
func (p *provider) HasInstance(node *corev1.Node) (bool, error) { | |||
machineID := node.Annotations[machineAnnotationKey] | |||
ns := node.Annotations[clusterNamespaceAnnotationKey] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the annotation is not present, or we fail to get it, what happens?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in this case, we'd try to call findMachine
with just the machineID
which would result in not finding it and thus returning the machine not found for node
error.
|
||
machine, err := p.controller.findMachine(machineID) | ||
machine, err := p.controller.findMachine(path.Join(ns, machineID)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused how we could update the machine lookup key in this way and not have that break things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this would work, unless ns
was empty perhaps? but if it ever ran on windows it would break. probably better if we just use a fmt.Sprintf
instead.
looking at the example here, i think the empty string would be handled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused how we could update the machine lookup key in this way and not have that break things.
To make sure I understood you correctly: are you worried about future refactorings where we'd change how the machine lookup key is created in the informer?
The question triggered a few questions I had about the informer and how it's set up.
Looking at the code and the debug output I see the following:
- machineInformer
keyFunc
isk8s.io/client-go/tools/cache.DeletionHandlingMetaNamespaceKeyFunc
- it additionally adds an indexer by providerID using
indexMachineByProviderID
The keyFunc
uses ObjectNames to construct the namespace/name string.
The lookup therefore can be done either by the providerID
(which we don't have in the corev1.Node object) or by namespace/name
string.
Using path.Join
or fmt.Sprintf
doesn't matter much, probably. If the ns
is not set, the machine won't be found. In the clusterapi codebase I saw both fmt.Sprintf
and path.Join
being used, for slightly different or similar purposes.
I'm happy to change or keep it as-is - please let me know your preference.
btw it would be great if somebody could give this an /ok-to-test. Thanks! |
/ok-to-test |
apologies @mweibel i lost track of this PR a little, wonder if there were any further thoughts from @jackfrancis about the last comment? |
What type of PR is this?
/kind bug
What this PR does / why we need it:
fixes lookup of Machines in MachineInformer store for the
HasInstance
caseWhich issue(s) this PR fixes:
Fixes #6774
Special notes for your reviewer:
It would be good to verify this with other kind of clusters and/or flags (I'm testing on a CAPZ cluster) to make sure the lookup works in all cases.
Does this PR introduce a user-facing change?