-
Notifications
You must be signed in to change notification settings - Fork 590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Validate uniqueness of prefix
field in KongVault
resources
#5395
Comments
Copied from related PR, webhook checks that rely on resources other than the single resource being evaluated are dicey. Checking against the current Kong state may be less bad than trying to check other resources available in the Kubernetes API, but it's still imperfect. We can at least fetch a single prefix from the Kong API, since it is an API identifier: https://docs.konghq.com/gateway/3.3.x/admin-api/#retrieve-vault |
Both solutions are not perfect and may judge on incorrect/inconsistent status. But I have the different opinion on which way to choose. What do you think is reason for checking on Kong API "less bad" than the other? @rainest |
AFAIK the two variants result in functionally equivalent race conditions. While either approach avoids duplicates when there's sufficient time between resource creation, neither avoids duplicates when both are created at roughly the same time. Cache inserts necessarily precede store inserts and config rendering, so I think the admin API approach actually exacerbates the problem: there's an additional 5s/tick period delay before the webhook will be able to detect a duplicate. Its advantage is more that we can check only the resource that matters, since we don't maintain a prefix index on the cache (we could add one though, and it's probably better to do so than listing and checking every KongVault). Startup behavior is worth considering as well: a fresh controller start essentially creates all existing resources at the same time from the controller cache perspective, and furthermore does not involve the webhook. Failure policy can prevent resource creation during downtime, but we don't expose this per resource hook and default it to Ignore because of the hooks we create for Secrets and other non-custom resources |
As an alternative, can we check at store insert time and add a conflict condition, and fail/requeue the reconcile? Rejecting and retrying in the controller reconcile isn't a universally available option, but I think it's possible here:
If we add an index on The main limitation here is that there's no guarantee which resource will win if both are present when the controller starts. If you create a conflicting resource, ignore the failure conditions on it, and leave it in place, you may find that your configuration changes after a restart. I'm actually not sure why we aren't doing this for other resources. It makes sense for KongConsumer usernames and custom IDs also. Given how simple it is, I have a nagging feeling we did consider it and found some reason to reject it, but can't recall when/what that was if so. Question to the controller-runtime community on whether there's some way to handle post-restart state rebuilds well: https://kubernetes.slack.com/archives/C02MRBMN00Z/p1704934745286119 They note that you can use creationTime to decide precedence. Although we've been over the problems with using this to judge the "correct" resource in the past, it does provide a reasonable enough tiebreaker that'd allow consistent order through restarts. That would lead to an edge case where modifying an existing older resource could clobber configuration, but consistency across restarts feels more valuable than avoiding that. IMO it's reasonable to make the prefix field immutable to avoid this. That field shouldn't change often (changing it would break all existing loose references in configuration using the vault), and needing to delete/recreate the CR if you typo it initially feels acceptable given the other benefits. |
That sounds like an interesting idea. I'm wondering what we do in the following case (all mentioned vaults are duplicates):
What would be the desired behavior for the question mark? I understand that we'd like to put the tiebreaker logic into the store. Would that mean it would remove the newer Vault B and add older Vault A (based on their creation timestamp), returning no error? What about propagating the conflict condition to Vault A in this case? I imagine we could trigger the reconciliation of Vault A by pushing an event via a channel and there we'd get an error from the store which would allow us to populate the condition properly. 🤔 Anyway, it's effectively a very similar effect that we'd get by breaking ties in the translator itself, we just shift where we keep the logic doing this. Maybe we could have a separate spike issue to verify if adding such logic to the store and controller would help us with other resources, just to not block the |
I have created 2 issues to track these:
|
#As a much simpler alternative: There is a constraint on the allowed values, though I'm not good at reading Lua patterns and am not quite sure how to express it as a regex instead. https://gitspartv.github.io/lua-patterns/?pattern=%5B%5E%5Ba-z%5D%5Ba-z%25d-%5D-%5Ba-z%25d%5D%2B%24%5D sorta helps.
I'm guessing this is approximately the hostname segment regex used for namespaces and names ( |
The initial store add attempt fails due to a conflict. If the attempted resource (A) wins the tiebreaker, the error handler deletes the existing conflicting resource (B) from the store adds the proposed resource (A), and returns a successful reconcile for (A), yeah. The error handler logic would be split, one half handling tiebreaker winners, updating the store, and returning success; the other half adding a conflict condition to the reconciled resource and returning failure to requeue it. I think the above's second question meant propagating the condition to B, since it's the one being evicted. I'd failed to consider that originally, but you're right, we need to mark it conflicted and start retrying it. The tiebreaker won half of the error handler would need to create a reconcile event via a channel source and the following reconcile would add the condition, since B will lose the tiebreaker.
The main benefit is that we do things "correctly" as far as controller-runtime is concerned. The translator can't do this as-is because the reconcile is already complete; the translator cannot return an error to the controller-runtime event processor and thus can't trigger a requeue or indicate a problem in the affected resource's status. Theoretically you could maybe rig up something to share error metadata from the translator, trigger channel events from the translator, and have the reconciler evict/update status of resources that have a translator error until their version increments, but it seems backwards and overcomplicated to do so if you can handle it entirely within the reconciler. This could maybe be a way to achieve eviction of resources causing lockups without redesigning the entire reconciler/store/translator/Kong error handler though 🤔 (famous last words). Other user-provided values that populate unique fields could also use the simpler all-reconciler approach, but I figured it'd make sense to build one and expand. Generating prefixes from namespace+name is the much simpler way to unblock this. |
Sounds like a reasonable solution for uniqueness. |
Is there an existing issue for this?
Problem Statement
Split from #5333
The prefix of Kong vault is unique, but we do not have a CEL to check the uniqueness of the
prefix
field inKongVault
resource. To avoid duplicate prefix to generate invalid Kong configuration causing Kong gateway stuck, we should check the uniqueness of prefix ofKongVault
in webhooks.Proposed Solution
KongVault
s and reject the update if the prefix of changedKongVault
appears in other instances.Additional information
No response
Acceptance Criteria
KongVault
with duplicatedprefix
will be rejected by webhookThe text was updated successfully, but these errors were encountered: