Skip to content

Acknowledge all ASO resource types before creating #5571

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

nojnhuh
Copy link
Contributor

@nojnhuh nojnhuh commented Apr 17, 2025

What type of PR is this?
/kind bug

What this PR does / why we need it:

This PR fixes a potential issue where if CAPZ creates an ASO resource defined by an ASOAPI resource, then crashes or otherwise fails to record that resource in the PATCH to the CAPZ resource status at the end of the reconciliation loop, a quickly following delete of the CAPZ resource (or removing that ASO resource from spec) will not know to also delete that ASO resource if it's not recorded in status.

This change makes CAPZ record each ASO resource group/version/kind in a sigs.k8s.io/cluster-api-provider-azure-owned-aso-kinds annotation and send those changes to the API server before continuing to actually create resources of that type. This way, no matter when a CAPZ ASO API resource is deleted, CAPZ can rely on any ASO resource type that needs to be deleted to be in that annotation.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:

I initially split this into a couple commits in case that's easier to review: one for the functional change and one to refactor things a bit.

TODOs:

  • squashed commits
  • includes documentation
  • adds unit tests
  • cherry-pick candidate

Release note:

ASOAPI: Fixed a possible bug that could leave ASO resources dangling when they should be deleted.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 17, 2025
@nojnhuh
Copy link
Contributor Author

nojnhuh commented Apr 17, 2025

/hold for squash

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Apr 17, 2025
@nojnhuh nojnhuh moved this from Todo to Needs Review in CAPZ Planning Apr 17, 2025
Copy link

codecov bot commented Apr 17, 2025

Codecov Report

Attention: Patch coverage is 85.63536% with 26 lines in your changes missing coverage. Please review.

Project coverage is 53.40%. Comparing base (bc33778) to head (725ec6d).
Report is 115 commits behind head on main.

Files with missing lines Patch % Lines
controllers/resource_reconciler.go 84.24% 17 Missing and 9 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5571      +/-   ##
==========================================
+ Coverage   52.87%   53.40%   +0.53%     
==========================================
  Files         272      272              
  Lines       29473    29602     +129     
==========================================
+ Hits        15583    15810     +227     
+ Misses      13083    12978     -105     
- Partials      807      814       +7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@nawazkh nawazkh self-assigned this Apr 18, 2025
@nojnhuh
Copy link
Contributor Author

nojnhuh commented Apr 23, 2025

@Jont828 points out that status should generally not be informing any reconcile logic. I now remember seeing that written down somewhere too, but can't find it in the API conventions or kubebuilder book. If you have a link that would be great to add here for posterity. I'll see if I can rework this PR.

@nawazkh
Copy link
Member

nawazkh commented Apr 23, 2025

The status summarizes the current state of the object in the system, and is usually persisted with the object by an automated processes but may be generated on the fly. At some cost and perhaps some temporary degradation in behavior, the status could be reconstructed by observation if it were lost

The PUT and POST verbs on objects MUST ignore the "status" values, to avoid accidentally overwriting the status in read-modify-write scenarios.

Reference 1 for the above text: https://book-v1.book.kubebuilder.io/basics/status_subresource
Reference 2 that originates from Ref 1. https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#spec-and-status

Although, the above links are from older generations of kubebuilder, the underlying convention of not relying on Status of a resource to perform a reconcile logic still is widely used.

Acknowledge all ASO resource types before creating
@nojnhuh
Copy link
Contributor Author

nojnhuh commented Apr 28, 2025

I've updated this to avoid using status from previous reconciliations as an input to anything. Main changes:

  • Orphaned resources are identified as resources not in spec that have an owner reference to this CAPZ resource.
  • The types of resources that might need to be checked are stored in an annotation to avoid having to list every resource type advertised by the API server.

/retitle Acknowledge all ASO resource types before creating

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 28, 2025
@k8s-ci-robot k8s-ci-robot changed the title Acknowledge all ASO resources before creating Acknowledge all ASO resource types before creating Apr 28, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from nawazkh. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@nawazkh
Copy link
Member

nawazkh commented Apr 30, 2025

/retest

2 similar comments
@willie-yao
Copy link
Contributor

/retest

@nawazkh
Copy link
Member

nawazkh commented May 2, 2025

/retest

@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented May 2, 2025

@nojnhuh: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cluster-api-provider-azure-apidiff 725ec6d link false /test pull-cluster-api-provider-azure-apidiff

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@nawazkh
Copy link
Member

nawazkh commented May 2, 2025

As of this comment, the apidiff job is expected to fail

+ WRAPPED_COMMAND_PID=18
+ wait 18
+ ./scripts/ci-apidiff.sh
*** Running go-apidiff ***
GOBIN=/home/prow/go/src/sigs.k8s.io/cluster-api-provider-azure/hack/tools/bin ./scripts/go_install.sh github.com/joelanford/go-apidiff go-apidiff v0.8.2
/home/prow/go/src/sigs.k8s.io/cluster-api-provider-azure/hack/tools/bin/go-apidiff 49f9ded3ce052ca01a9721342cb685c842e1570d
 sigs.k8s.io/cluster-api-provider-azure/api/v1alpha1
  Incompatible changes:
  - (*AzureASOManagedCluster).GetResourceStatuses: removed
  - (*AzureASOManagedControlPlane).GetResourceStatuses: removed
  - (*AzureASOManagedMachinePool).GetResourceStatuses: removed
 sigs.k8s.io/cluster-api-provider-azure/controllers
  Incompatible changes:
  - (*ResourceReconciler).Reconcile: changed from func(context.Context) error to func(context.Context) (bool, error)
+ EXIT_VALUE=1
+ set +o xtrace

@nawazkh
Copy link
Member

nawazkh commented May 8, 2025

Still reviewing this PR, little too dense to skim through.

Copy link
Member

@nawazkh nawazkh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @nojnhuh ! This is a cool PR, took some time to understand it. Sorry for the delay there.

I have some questions and suggestions listed below.

@@ -39,6 +42,8 @@ import (
"sigs.k8s.io/cluster-api-provider-azure/util/tele"
)

const ownedKindsAnnotation = "sigs.k8s.io/cluster-api-provider-azure-owned-aso-kinds"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should move this annotation to /azure/const.go unless there is a specific reason to keep it here ?
We also should add explanatory comment on the significance of using this annotation.

for _, spec := range r.owner.GetResourceStatuses() {
newStatus, err := r.deleteResource(ctx, spec.Resource)
ownedKindsValue := r.owner.GetAnnotations()[ownedKindsAnnotation]
ownedKinds, err := parseOwnedKinds(ownedKindsValue)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It be great if we can add a supporting unit test for parseOwnedKinds()

return false, fmt.Errorf("failed to parse %s annotation: %s", ownedKindsAnnotation, ownedKindsValue)
}

ownedObjs, err := r.ownedObjs(ctx, ownedKinds)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same suggestion as above, would love to see a unit test validating r.ownedObjs().

Comment on lines +306 to +311

// partitionResources splits the sets of resources in spec and the current set
// of owned, existing resources into three groups:
// - unobservedTypeResources are of a type not yet known to this owning CAPZ resource.
// - observedTypeResources are of a type already known to this owning CAPZ resource.
// - deletedResources exist but are not defined in spec.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we update deletedResources to toBeDeletedResources since the resources saved in this struct will be issued a delete later on ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- observedTypeResources are of a type already known to this owning CAPZ resource.

I might be missing the significance of the word "Type" here ?
When you say observedTypeResources does "TypeResource" signify something other than a typed entity in Go ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we update below resources instead ?

  • unobservedTypeResources to unrecordedResources
  • observedTypeResources to recordedResources

}

for _, spec := range observedTypeResources {
spec.SetNamespace(r.owner.GetNamespace())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this update, spec.SetNamespace(), result in validation-update-error due to immutability of Namespace on an ASO resource ?

Comment on lines +326 to +330
if ownedKinds.Has(typeMeta) {
observedTypeResources = append(observedTypeResources, spec)
} else {
unobservedTypeResources = append(unobservedTypeResources, spec)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If ownedKinds also gets updated with the TypeMeta of the deletedResources, will the user be able to recreate a deleted resource with a new name ?

Comment on lines +129 to +130
unobservedTypeResources, observedTypeResources, deletedResources := partitionResources(ownedKinds, r.resources, ownedObjs)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we incorporate the below suggestion for better readability ?

Suggested change
unobservedTypeResources, observedTypeResources, deletedResources := partitionResources(ownedKinds, r.resources, ownedObjs)
unrecordedResources, recordedResources, toBeDeletedResources := partitionResources(ownedKinds, r.resources, ownedObjs)


gvk := spec.GroupVersionKind()
log.V(4).Info("applying resource", "resource", klog.KObj(spec), "resourceVersion", gvk.GroupVersion(), "resourceKind", gvk.Kind)
err := r.Patch(ctx, spec, client.Apply, client.FieldOwner("capz-manager"), client.ForceOwnership)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my understanding, how is the controller's GUID set upon creation ? Is client.ForceOwnership enforcing the controller's GUID ?

r.owner.SetResourceStatuses(newResourceStatuses)

return nil
return len(unobservedTypeResources) > 0, nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we only going to trigger a requeue if len(unobservedTypeResources) > 0 ? Should we be triggering a requeue nonetheless ?

}
}

func parseOwnedKinds(value string) (sets.Set[metav1.TypeMeta], error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably have a unit test for this function validating a scenario with an ownerKind with multiple / in it.

@willie-yao
Copy link
Contributor

@nojnhuh Let me know if this is ready for review again!

@nawazkh
Copy link
Member

nawazkh commented May 30, 2025

Un-assigning myself from this PR for now since I won't be able to review it again. :)
/unassign

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 10, 2025
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/bug Categorizes issue or PR as related to a bug. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
Status: Needs Review
Development

Successfully merging this pull request may close these issues.

4 participants