Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Resources can get stuck deleting if they were never actually created due to invalid name #2607

Open
matthchr opened this issue Nov 22, 2022 · 7 comments
Labels
bug 🪲 Something isn't working low-priority Low priority item. We'll get to it eventually.

Comments

@matthchr
Copy link
Member

Version of Azure Service Operator
All versions of v2

Describe the bug
Resources which fail to create because their name is invalid can get stuck deleting. This is because the operator always issues a DELETE HTTP request for the AzureName of the resource and if that delete fails does not proceed with deletion of the resource in Kubernetes. This behavior (if delete fails don't delete in k8s) is desired as it prevents users from accidentally leaking resources in their Azure subscription without realizing it.

In the case that the name is bad, the DELETE fails with an error saying the name is bad. For example:

controllers/controller/resourcegroup "msg"="Reconciler error" "error"="deleting resource \"/subscriptions/{sub}/resourceGroups/map[create:true name:]\": DELETE https://management.azure.com/subscriptions/{sub}/resourceGroups/map[create:true name:]
--------------------------------------------------------------------------------
RESPONSE 400: 400 Bad Request
ERROR CODE: InvalidResourceGroup
--------------------------------------------------------------------------------

or

Status:
  Conditions:
    Last Transition Time:  2022-11-07T18:46:54Z
    Message:               deleting resource "/subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/dev-rg/providers/Microsoft.KeyVault/vaults/kvname/providers/Microsoft.Authorization/roleAssignments/kv-role-assignement3": DELETE https://management.azure.com/subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/dev-rg/providers/Microsoft.KeyVault/vaults/kvname/providers/Microsoft.Authorization/roleAssignments/kv-role-assignement3
--------------------------------------------------------------------------------
RESPONSE 400: 400 Bad Request
ERROR CODE: InvalidRoleAssignmentId
--------------------------------------------------------------------------------
{
  "error": {
    "code": "InvalidRoleAssignmentId",
    "message": "The role assignment ID 'kv-role-assignement3' is not valid. The role assignment ID must be a GUID."
  }
}
--------------------------------------------------------------------------------

Workaround
Annotate the resource stuck deleting with reconcile-policy: skip, which will allow the delete to proceed.

Additional context
This has been mentioned in a number of user-filed issues:
#2478
#2586
#1273

@matthchr matthchr added the bug 🪲 Something isn't working label Nov 22, 2022
@matthchr
Copy link
Member Author

There are patterns we could implement to make this experience better without compromising the fault tolerance guarantees, something like:

  1. Send request to Azure
  2. If (after polling any requisite long-running operations) the creation for sure failed (4xx status code), set an annotation not-successfully-created).
  3. When deleting, if not-successfully-created annotation is set, let delete proceed without ever issuing request to Azure.

There's a good bit of complexity in making that pattern work for all cases/resource-types though, so we haven't done it so far given the simple workaround.

@matthchr
Copy link
Member Author

Another stopgap solution is to do a better job rejecting invalid names at the k8s APIServer level. This would make the problem rarer although it could still happen. Making name validation better wouldn't resolve the cases like the permissions one mentioned in #1273 though.

@theunrepentantgeek
Copy link
Member

What about doing a GET for the resource and checking for a 404 response?
If Azure disavows any knowledge of the resource, maybe it's OK for us to remove the Kubernetes Resource as well?

@matthchr
Copy link
Member Author

matthchr commented Nov 23, 2022

I believe that GET will also 400 BadRequest because the specified name is invalid, and if we can distinguish "this name is invalid" from an error code perspective we don't need to GET we can just PUT and say "if we get the NotValidName code count that as a success and let k8s resource delete"

@matthchr matthchr added this to the v2.0.0-beta.5 milestone Nov 29, 2022
@matthchr matthchr added this to Backlog in Azure Service Operator Roadmap via automation Nov 29, 2022
@matthchr
Copy link
Member Author

This is still something we're thinking about, but there hasn't been a lot of user clamor about it and there is a workaround so its not a high priority right now.

@theunrepentantgeek theunrepentantgeek modified the milestones: v2.3.0, v2.2.0 May 31, 2023
@matthchr
Copy link
Member Author

  • Let's look at resources and see if there are many (any?) with resource names that don't have validation on it.
  • Let's document in our FAQ the way out of this situation if it does happen. Specifically that you can change the name of a resource if it hasn't successfully been created.

@theunrepentantgeek
Copy link
Member

Let's look at resources and see if there are many (any?) with resource names that don't have validation on it.

Using VS Code and doing a regex search for instances of AzureName preceeded by +kubebuilder:validation comments, I found 102 matches.

Regex:

(\s*// \+kubebuilder:validation.*\n)+(\s*//.*\n)*\s*AzureName\s+string

Doing a similar search, but just for AzureName as a struct field, I found 466 maches.

Regex:

AzureName\s*string

This implies the majority of AzureName properties lack validation.

Manually checking the search results, here are the first ten distinct cases:

  • authorization/RoleAssignment_Spec
  • cache/Redis_FirewallRule_Spec
  • cache/Redis_LinkedServer_Spec
  • cache/Redis_Spec
  • cache/RedisEnterprise_Database_Spec
  • cache/RedisEnterprise_Spec
  • cdn/Profiles_Endpoint_Spec
  • cdn/Profile_Spec
  • compute/Snapshot_Spec
  • compute/Disk_Spec

@matthchr matthchr modified the milestones: v2.2.0, v2.3.0 Jul 6, 2023
@theunrepentantgeek theunrepentantgeek modified the milestones: v2.5.0, v2.4.0 Sep 7, 2023
@matthchr matthchr modified the milestones: v2.4.0, v2.5.0 Oct 23, 2023
@theunrepentantgeek theunrepentantgeek modified the milestones: v2.6.0, v2.5.0 Nov 14, 2023
@theunrepentantgeek theunrepentantgeek modified the milestones: v2.5.0, v2.6.0, v2.7.0 Dec 11, 2023
@matthchr matthchr added the low-priority Low priority item. We'll get to it eventually. label Feb 22, 2024
@matthchr matthchr removed this from the v2.6.0 milestone Feb 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🪲 Something isn't working low-priority Low priority item. We'll get to it eventually.
Projects
Development

No branches or pull requests

2 participants