fix(clusterapi): HasInstance with namespace prefix #6776

mweibel · 2024-04-29T09:38:55Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

fixes lookup of Machines in MachineInformer store for the HasInstance case

Which issue(s) this PR fixes:

Special notes for your reviewer:

It would be good to verify this with other kind of clusters and/or flags (I'm testing on a CAPZ cluster) to make sure the lookup works in all cases.

Does this PR introduce a user-facing change?

NONE

k8s-ci-robot · 2024-04-29T09:39:00Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mweibel
Once this PR has been reviewed and has the lgtm label, please assign detiber for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

cluster-autoscaler/cloudprovider/clusterapi/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2024-04-29T09:39:04Z

Hi @mweibel. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mweibel · 2024-04-29T09:40:22Z

ping @elmiko @MaxFedotov for authoring/reviewing the original PR #6708

MaxFedotov · 2024-04-29T11:13:23Z

@mweibel Thanks, that's my bad - I was looking as an example at findMachineByProviderID function:

autoscaler/cluster-autoscaler/cloudprovider/clusterapi/clusterapi_controller.go

Lines 335 to 336 in 4f1c8e6

    
           machineID := node.Annotations[machineAnnotationKey] 
        
           return c.findMachine(machineID)

Although there is a test for this function, which passed without errors, there are two problems in it (which I missed completely):

In TestControllerFindMachineByProviderID the part where machine is found using MachineAnnotation on Node is skipped:

autoscaler/cluster-autoscaler/cloudprovider/clusterapi/clusterapi_controller_test.go

Lines 742 to 750 in 4f1c8e6

    
           // Remove all the "machine" annotation values on all the 
        
           // nodes. We want to force findMachineByProviderID() to only 
        
           // be successful by searching on provider ID. 
        
           for _, node := range testConfig.nodes { 
        
           	delete(node.Annotations, machineAnnotationKey) 
        
           	if err := controller.nodeInformer.GetStore().Update(node); err != nil { 
        
           		t.Fatalf("unexpected error updating node, got %v", err) 
        
           	} 
        
           }

When test data is generated, MachineAnnotation on Node is constructed using namespace/machine.Name format:

autoscaler/cluster-autoscaler/cloudprovider/clusterapi/clusterapi_controller_test.go

Line 461 in 4f1c8e6

machineAnnotationKey: fmt.Sprintf("%s/%s-%s-machine-%d", namespace, namespace, owner.Name, i),

While in cluster-api it is just machine.Name:
https://github.com/kubernetes-sigs/cluster-api/blob/b29e26c0210d12fffb1c1d59e5bbcd492242801e/internal/controllers/machine/machine_controller_noderef.go#L109

Also, ClusterNamespaceAnnotation is not set at all.

So your logic in this case is correct, thanks for finding this out!

But to prevent such problems in the future, we need to update tests as well.

@elmiko WDYT? I can make a separate issue and update all tests.

mweibel · 2024-04-29T15:23:40Z

@MaxFedotov thanks!

I tried updating the tests by doing the following change:

diff --git a/cluster-autoscaler/cloudprovider/clusterapi/clusterapi_controller_test.go b/cluster-autoscaler/cloudprovider/clusterapi/clusterapi_controller_test.go
index fc89fdae5..e374ea447 100644
--- a/cluster-autoscaler/cloudprovider/clusterapi/clusterapi_controller_test.go
+++ b/cluster-autoscaler/cloudprovider/clusterapi/clusterapi_controller_test.go
@@ -458,7 +458,8 @@ func makeLinkedNodeAndMachine(i int, namespace, clusterName string, owner metav1
                ObjectMeta: metav1.ObjectMeta{
                        Name: fmt.Sprintf("%s-%s-node-%d", namespace, owner.Name, i),
                        Annotations: map[string]string{
-                               machineAnnotationKey: fmt.Sprintf("%s/%s-%s-machine-%d", namespace, namespace, owner.Name, i),
+                               machineAnnotationKey:          fmt.Sprintf("%s-%s-machine-%d", namespace, owner.Name, i),
+                               clusterNamespaceAnnotationKey: namespace,
                        },
                },
                Spec: corev1.NodeSpec{

As a result TestControllerFindMachineFromNodeAnnotation on line 1252-1257 fails because it can't find the machine because the machineAnnotationKey value doesn't match with what is in Node.Spec.ProviderID. I'm unsure what to change there currently, because Azure is a special case already (normalizedProviderString treats VMSS as a special case) and I'm not sure what other providers do.
Might keep digging a bit more but in case you have an idea or want to do a follow-up, I'm happy too :)

jackfrancis · 2024-04-29T18:52:22Z

/test pull-cluster-autoscaler-e2e-azure

@mweibel I'm maintainining some Azure-infra cluster-autoscaler tests, one scenario of which includes the clusterapi provider (running in Azure via CAPZ), so this test run will get some basic signal against your change

elmiko · 2024-04-30T13:10:34Z

@MaxFedotov i definitely think it would be cool to make the tests more accurate. although, i don't want to break other stuff.

@mweibel thanks for the PR, i'm still understanding the nuances that you and Max are talking about. but i will spend some time today/tomorrow reviewing this.

elmiko

i think this looks good, just a question about failures

elmiko · 2024-05-09T13:41:51Z

cluster-autoscaler/cloudprovider/clusterapi/clusterapi_provider.go

@@ -86,8 +87,9 @@ func (p *provider) NodeGroupForNode(node *corev1.Node) (cloudprovider.NodeGroup,
 // HasInstance returns whether a given node has a corresponding instance in this cloud provider
 func (p *provider) HasInstance(node *corev1.Node) (bool, error) {
 	machineID := node.Annotations[machineAnnotationKey]
+	ns := node.Annotations[clusterNamespaceAnnotationKey]


if the annotation is not present, or we fail to get it, what happens?

in this case, we'd try to call findMachine with just the machineID which would result in not finding it and thus returning the machine not found for node error.

jackfrancis · 2024-05-09T18:29:18Z

cluster-autoscaler/cloudprovider/clusterapi/clusterapi_provider.go


-	machine, err := p.controller.findMachine(machineID)
+	machine, err := p.controller.findMachine(path.Join(ns, machineID))


I'm a bit confused how we could update the machine lookup key in this way and not have that break things.

i think this would work, unless ns was empty perhaps? but if it ever ran on windows it would break. probably better if we just use a fmt.Sprintf instead.

looking at the example here, i think the empty string would be handled.

I'm a bit confused how we could update the machine lookup key in this way and not have that break things.

To make sure I understood you correctly: are you worried about future refactorings where we'd change how the machine lookup key is created in the informer?

The question triggered a few questions I had about the informer and how it's set up.
Looking at the code and the debug output I see the following:

machineInformer keyFunc is k8s.io/client-go/tools/cache.DeletionHandlingMetaNamespaceKeyFunc

it additionally adds an indexer by providerID using indexMachineByProviderID

The keyFunc uses ObjectNames to construct the namespace/name string.

The lookup therefore can be done either by the providerID (which we don't have in the corev1.Node object) or by namespace/name string.

Using path.Join or fmt.Sprintf doesn't matter much, probably. If the ns is not set, the machine won't be found. In the clusterapi codebase I saw both fmt.Sprintf and path.Join being used, for slightly different or similar purposes.

I'm happy to change or keep it as-is - please let me know your preference.

mweibel · 2024-05-13T08:20:03Z

btw it would be great if somebody could give this an /ok-to-test. Thanks!

elmiko · 2024-05-13T13:05:32Z

/ok-to-test

elmiko · 2024-06-07T17:54:34Z

apologies @mweibel i lost track of this PR a little, wonder if there were any further thoughts from @jackfrancis about the last comment?

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. labels Apr 29, 2024

k8s-ci-robot requested a review from hardikdr April 29, 2024 09:39

k8s-ci-robot added the area/cluster-autoscaler label Apr 29, 2024

k8s-ci-robot requested a review from jackfrancis April 29, 2024 09:39

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. area/provider/cluster-api Issues or PRs related to Cluster API provider labels Apr 29, 2024

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Apr 29, 2024

fix(clusterapi): HasInstance with namespace prefix

332ac2c

mweibel force-pushed the clusterapi-fix-HasInstance branch from c616556 to 332ac2c Compare April 29, 2024 09:39

k8s-ci-robot removed the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Apr 29, 2024

elmiko reviewed May 9, 2024

View reviewed changes

jackfrancis reviewed May 9, 2024

View reviewed changes

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(clusterapi): HasInstance with namespace prefix #6776

fix(clusterapi): HasInstance with namespace prefix #6776

mweibel commented Apr 29, 2024

k8s-ci-robot commented Apr 29, 2024

k8s-ci-robot commented Apr 29, 2024

mweibel commented Apr 29, 2024

MaxFedotov commented Apr 29, 2024 •

edited

mweibel commented Apr 29, 2024

jackfrancis commented Apr 29, 2024

elmiko commented Apr 30, 2024

elmiko left a comment

elmiko May 9, 2024

mweibel May 13, 2024

jackfrancis May 9, 2024

elmiko May 9, 2024 •

edited

mweibel May 13, 2024

mweibel commented May 13, 2024

elmiko commented May 13, 2024

elmiko commented Jun 7, 2024


		machine, err := p.controller.findMachine(machineID)
		machine, err := p.controller.findMachine(path.Join(ns, machineID))

fix(clusterapi): HasInstance with namespace prefix #6776

Are you sure you want to change the base?

fix(clusterapi): HasInstance with namespace prefix #6776

Conversation

mweibel commented Apr 29, 2024

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

k8s-ci-robot commented Apr 29, 2024

k8s-ci-robot commented Apr 29, 2024

mweibel commented Apr 29, 2024

MaxFedotov commented Apr 29, 2024 • edited

mweibel commented Apr 29, 2024

jackfrancis commented Apr 29, 2024

elmiko commented Apr 30, 2024

elmiko left a comment

Choose a reason for hiding this comment

elmiko May 9, 2024

Choose a reason for hiding this comment

mweibel May 13, 2024

Choose a reason for hiding this comment

jackfrancis May 9, 2024

Choose a reason for hiding this comment

elmiko May 9, 2024 • edited

Choose a reason for hiding this comment

mweibel May 13, 2024

Choose a reason for hiding this comment

mweibel commented May 13, 2024

elmiko commented May 13, 2024

elmiko commented Jun 7, 2024

MaxFedotov commented Apr 29, 2024 •

edited

elmiko May 9, 2024 •

edited