Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(clusterapi): HasInstance with namespace prefix #6776

Merged
merged 1 commit into from
Jul 12, 2024

Conversation

mweibel
Copy link
Contributor

@mweibel mweibel commented Apr 29, 2024

What type of PR is this?

/kind bug

What this PR does / why we need it:

fixes lookup of Machines in MachineInformer store for the HasInstance case

Which issue(s) this PR fixes:

Fixes #6774

Special notes for your reviewer:

It would be good to verify this with other kind of clusters and/or flags (I'm testing on a CAPZ cluster) to make sure the lookup works in all cases.

Does this PR introduce a user-facing change?

NONE

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. labels Apr 29, 2024
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. area/provider/cluster-api Issues or PRs related to Cluster API provider labels Apr 29, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @mweibel. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Apr 29, 2024
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Apr 29, 2024
@mweibel
Copy link
Contributor Author

mweibel commented Apr 29, 2024

ping @elmiko @MaxFedotov for authoring/reviewing the original PR #6708

@MaxFedotov
Copy link
Contributor

MaxFedotov commented Apr 29, 2024

@mweibel Thanks, that's my bad - I was looking as an example at findMachineByProviderID function:

machineID := node.Annotations[machineAnnotationKey]
return c.findMachine(machineID)

Although there is a test for this function, which passed without errors, there are two problems in it (which I missed completely):

  1. In TestControllerFindMachineByProviderID the part where machine is found using MachineAnnotation on Node is skipped:

    // Remove all the "machine" annotation values on all the
    // nodes. We want to force findMachineByProviderID() to only
    // be successful by searching on provider ID.
    for _, node := range testConfig.nodes {
    delete(node.Annotations, machineAnnotationKey)
    if err := controller.nodeInformer.GetStore().Update(node); err != nil {
    t.Fatalf("unexpected error updating node, got %v", err)
    }
    }

  2. When test data is generated, MachineAnnotation on Node is constructed using namespace/machine.Name format:

    machineAnnotationKey: fmt.Sprintf("%s/%s-%s-machine-%d", namespace, namespace, owner.Name, i),

While in cluster-api it is just machine.Name:
https://github.com/kubernetes-sigs/cluster-api/blob/b29e26c0210d12fffb1c1d59e5bbcd492242801e/internal/controllers/machine/machine_controller_noderef.go#L109

Also, ClusterNamespaceAnnotation is not set at all.

So your logic in this case is correct, thanks for finding this out!

But to prevent such problems in the future, we need to update tests as well.

@elmiko WDYT? I can make a separate issue and update all tests.

@mweibel
Copy link
Contributor Author

mweibel commented Apr 29, 2024

@MaxFedotov thanks!

I tried updating the tests by doing the following change:

diff --git a/cluster-autoscaler/cloudprovider/clusterapi/clusterapi_controller_test.go b/cluster-autoscaler/cloudprovider/clusterapi/clusterapi_controller_test.go
index fc89fdae5..e374ea447 100644
--- a/cluster-autoscaler/cloudprovider/clusterapi/clusterapi_controller_test.go
+++ b/cluster-autoscaler/cloudprovider/clusterapi/clusterapi_controller_test.go
@@ -458,7 +458,8 @@ func makeLinkedNodeAndMachine(i int, namespace, clusterName string, owner metav1
                ObjectMeta: metav1.ObjectMeta{
                        Name: fmt.Sprintf("%s-%s-node-%d", namespace, owner.Name, i),
                        Annotations: map[string]string{
-                               machineAnnotationKey: fmt.Sprintf("%s/%s-%s-machine-%d", namespace, namespace, owner.Name, i),
+                               machineAnnotationKey:          fmt.Sprintf("%s-%s-machine-%d", namespace, owner.Name, i),
+                               clusterNamespaceAnnotationKey: namespace,
                        },
                },
                Spec: corev1.NodeSpec{

As a result TestControllerFindMachineFromNodeAnnotation on line 1252-1257 fails because it can't find the machine because the machineAnnotationKey value doesn't match with what is in Node.Spec.ProviderID. I'm unsure what to change there currently, because Azure is a special case already (normalizedProviderString treats VMSS as a special case) and I'm not sure what other providers do.
Might keep digging a bit more but in case you have an idea or want to do a follow-up, I'm happy too :)

@jackfrancis
Copy link
Contributor

/test pull-cluster-autoscaler-e2e-azure

@mweibel I'm maintainining some Azure-infra cluster-autoscaler tests, one scenario of which includes the clusterapi provider (running in Azure via CAPZ), so this test run will get some basic signal against your change

@elmiko
Copy link
Contributor

elmiko commented Apr 30, 2024

@MaxFedotov i definitely think it would be cool to make the tests more accurate. although, i don't want to break other stuff.

@mweibel thanks for the PR, i'm still understanding the nuances that you and Max are talking about. but i will spend some time today/tomorrow reviewing this.

Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this looks good, just a question about failures

@mweibel
Copy link
Contributor Author

mweibel commented May 13, 2024

btw it would be great if somebody could give this an /ok-to-test. Thanks!

@elmiko
Copy link
Contributor

elmiko commented May 13, 2024

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 13, 2024
@elmiko
Copy link
Contributor

elmiko commented Jun 7, 2024

apologies @mweibel i lost track of this PR a little, wonder if there were any further thoughts from @jackfrancis about the last comment?

@mweibel
Copy link
Contributor Author

mweibel commented Jun 25, 2024

bumping this hoping to get a reply from @jackfrancis regarding what @elmiko asked :)

@jackfrancis
Copy link
Contributor

code lgtm

What's the latest on the viability of updating the tests as per the above conversation between @mweibel and @MaxFedotov?

@mweibel mweibel force-pushed the clusterapi-fix-HasInstance branch from 332ac2c to 98f9489 Compare July 1, 2024 11:38
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jul 1, 2024
@mweibel
Copy link
Contributor Author

mweibel commented Jul 1, 2024

@jackfrancis thanks for the reminder!

I had another look this morning on the tests and actually found a missing update to a .findMachine() call thanks to updating the tests accordingly to what @MaxFedotov mentioned! Nice catch there 👍

Would be great to get another review.

Given that I overlooked one call for findMachine I wonder if we should adjust the functions signature to accept a namespace/name or something similar to avoid potential issues in the future.

@elmiko
Copy link
Contributor

elmiko commented Jul 9, 2024

@mweibel i'll take a look this week, thanks for the update!

Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks good to me, thanks for keeping it alive.

i'm adding lgtm for now to give a last chance for reviews, if nothing comes up i will approve by end of week (unless someone beats me to it XD).

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 10, 2024
@elmiko
Copy link
Contributor

elmiko commented Jul 12, 2024

it seems like there are no further requests or objections, so i am going to approve this. thanks everyone!

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: elmiko, mweibel

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 12, 2024
@k8s-ci-robot k8s-ci-robot merged commit 1997b5f into kubernetes:master Jul 12, 2024
6 checks passed
@mweibel
Copy link
Contributor Author

mweibel commented Jul 29, 2024

@elmiko thanks for merging! can you take care of including this in the next patch release, too? 🙇

@elmiko
Copy link
Contributor

elmiko commented Jul 29, 2024

@mweibel yes,i will try to make sure that it's in the next patch release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cluster-autoscaler area/provider/cluster-api Issues or PRs related to Cluster API provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

clusterapi: HasInstance function tries to find machine with wrong key
5 participants