Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Controller selectors should be immutable after creation #50808

Closed
crimsonfaith91 opened this issue Aug 16, 2017 · 26 comments · Fixed by #50719
Closed

Controller selectors should be immutable after creation #50808

crimsonfaith91 opened this issue Aug 16, 2017 · 26 comments · Fixed by #50719
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/apps Categorizes an issue or PR as relevant to SIG Apps.

Comments

@crimsonfaith91
Copy link
Contributor

crimsonfaith91 commented Aug 16, 2017

Background
As part of the greater controller GA effort, we have removed defaulting spec.selector to spec.template.labels values in apps/v1beta2 as the defaulting operation is semantically broken. In this issue, we propose to make spec.selector immutable after API object creation.

Motivation

  1. Changing selectors leads to undefined behaviors - users are not expected to change the selectors
  2. Having selectors immutable ensures they always match created children, preventing events such as accidental bulk orphaning
  3. Having selectors mutable for GA is an immortal decision, which does not guarantee API backward compatibility

Immutability resolves some challenging issues that plague many teams for over a year e.g., #34292, and the current behavior for controllers with respect to selector mutations is undefined e.g., #26202.

In contrast, by making selectors immutable, validation ensures the selectors always match template labels of created children. This provides a risk-free, deterministic behavior to resolve the issue, and meet the general goal of declarative configuration.

We could lift the immutability in future after all controller types have well-defined behaviors (i.e., after GA). This provides ability to address new use cases while maintaining API backward compatibility. However, if we do not make selectors immutable prior to GA, we will have to live with mutable selectors for the lifetime of the v1 API and the controllers’ undefined behaviors until reasonable semantics are decided and implemented.

/kind bug
/sig apps
/sig api-machinery
@bgrant0607 @erictune @kow3ns @enisoc @janetkuo @foxish @liyinan926
@liggitt @smarterclayton @Kargakis @mikedanese

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. labels Aug 16, 2017
@smarterclayton
Copy link
Contributor

Removing defaulting is a breaking API change? Can you link the issue where we removed it?

@crimsonfaith91
Copy link
Contributor Author

@smarterclayton The server-side defaulting was removed for apps/v1beta2 only.
The (closed) issue: #50339
The (merged) PR: #50164

@liggitt
Copy link
Member

liggitt commented Aug 17, 2017

Immutability resolves some challenging issues that plague many teams for over a year e.g., #34292, and the current behavior for controllers with respect to selector mutations is undefined e.g., #26202.

Both of those issues look like they were related to defaulting, not immutability.

@liggitt
Copy link
Member

liggitt commented Aug 17, 2017

In contrast, by making selectors immutable, validation ensures the selectors always match template labels of created children.

For that to work, you would also have to make the template labels immutable, which is another significant change (updating template labels is currently allowed).

However, if we do not make selectors immutable prior to GA, we will have to live with mutable selectors for the lifetime of the v1 API

I don't see a clear way to enforce immutability for the v1 API only, while staying compatible with the existing beta API.

@0xmichalis
Copy link
Contributor

@mfojtik @tnozicka

@kow3ns
Copy link
Member

kow3ns commented Aug 17, 2017

@smarterclayton defaulting is removed only in apps/v1beta2 not in apps/v1beta1. As you say, that is a breaking change that we can't make to the existing group version.

@liggitt if we make the selector immutable, and require via validation (as we do now) that the labels match the provided selector for the object to be valid, we get the desired behavior. This is a less strict requirement than making the labels immutable.

Wrt #26202, it contains a long discussion about the problems regarding selector mutation and links to other relevant issues. @bgrant0607's @smarterclayton's comments in this thread are still as true right now as they were at the time the issue was opened. Our controllers were not designed to handle selector changes gracefully, and whatever the user is attempting to achieve by mutating the selector they could achieve by other means (mostly).

I'm not trying to say that we should not support mutable selectors in the future, but we don't support it gracefully now, and we don't have a design for what the exact semantics of selector mutation are (and they may not uniform across all our controllers).

If we make the selectors immutable now, the community has time to offer designs for how selector mutations should work (assuming this feature generates enough interest), and we can make them mutable without breaking backward compatibility.

If we advance to apps/v1 with mutable selectors, we have to support the current undefined semantics for the lifetime of that group version.

To be clear we want to implement selector immutability in such a way apps/v1beta1 is not affected, and only API versions greater than this have immutable selectors.

@kow3ns kow3ns closed this as completed Aug 17, 2017
@kow3ns kow3ns reopened this Aug 17, 2017
@kow3ns
Copy link
Member

kow3ns commented Aug 17, 2017

@smarterclayton issue for defaulting #50339

@crimsonfaith91
Copy link
Contributor Author

@liggitt @smarterclayton @Kargakis @mikedanese @mfojtik @tnozicka
I came out with a PR that makes the selector immutable without changing validation codes. Feel free to review it to ensure it looks good to you. Thanks!
#50719

@bgrant0607
Copy link
Member

First of all, I'd like to find a way to make this happen. I think it will be less error-prone for users. I think there are some patterns we will want to allow, such as allowing new !key selector terms in order to facilitate adoption, canaries, and other evolution scenarios, but those can be pushed into the future.

Note that we still intend to make validation versioned eventually: #8550

Also, if we don't want to break old API versions now, we'd still need to allow mutation via old API paths, which means that clients of the new API will potentially observe changes but not be able to make them. I think that's fine, because we should clearly document that observers should be able to deal with selector changes, which will be necessary if we plan to allow mutations again in the future.

@smarterclayton
Copy link
Contributor

A couple of thoughts.

Because this is so important, and we can't really undo it, and we will be breaking existing users, even if we say it's a new api version, because it's basically changing the existing objects, we should at least lay out all the arguments for and against together in one spot (this issue is fine) and we should be sure we are completely comfortable with the ramifications.

  1. It will be less error prone, but we will be breaking valid use cases (canaries, mutation) at the same time. That suggests that full immutability, while safer for some users, might inconvenience more users.
  2. We haven't really canvassed to determine how many people depend on 1 in production today (deployments are effectively production) so being able to say we aren't impacting anyone is important and if it has been done I missed it.
  3. I can think of a few other ways than making this purely immutable - for instance disallowing changes to both in a single update, or allowing both to be changed only if they select the previous set - but those would be slightly confusing. Can we list the cons of those in this issue for posterity (the other issue is too fragmentary now)?
  4. If there is a valid workaround to immutability (delete then create), we should make sure our tools support it sufficiently not to break the workflow (i.e. anyone in 2 needs a way to move forward)
  5. Preventing update of deployments selectors doesn't prevent mutation of the labels of replica sets, so we're not completely closing the hole, although modifying replica sets does seem more footgun.
  6. We don't seem to have a clear idea of how we would fix this in the long run, which worries me slightly.

My primary concern is 2 - breaking real users. If we break more users than we help, we need to have a good justification (which might be as simple as "selector mutation is catastrophic, this is merely workflow inconvenient") and that we document it here.

@smarterclayton
Copy link
Contributor

One more

  1. Silently dropping mutations worries me a lot, because that means people who depended on this behavior (2) are now silently broken, which doesn't seem excusable.

@bgrant0607
Copy link
Member

Another approach, which we've used with other features, is that we could make arbitrary mutation optional, enabled by default in the old API version and disabled by default in the new one.

Spreading updates across multiple updates is not a precedent I want to set for the API.

@liggitt
Copy link
Member

liggitt commented Aug 24, 2017

@smarterclayton
If there is a valid workaround to immutability (delete then create), we should make sure our tools support it sufficiently not to break the workflow (i.e. anyone in 2 needs a way to move forward)

Requiring delete-then-recreate in order to change labels on the produced child objects can be really disruptive:

  • you lose all the benefits of update strategies and have to manage updating without an outage yourself
  • you have to choose whether to lose or orphan anything referencing the specific instance via ownerrefs

@bgrant0607
Spreading updates across multiple updates is not a precedent I want to set for the API.

agree, that only slightly helps accidental cases, and encourages anti-patterns (update! update again!)

@bgrant0607
Copy link
Member

bgrant0607 commented Aug 24, 2017

I'm going to partially take back something I just wrote:

My main concern with mutability is that we don't make it easy to accidentally orphan pods. We already have validation that the selector actually matches the pod template, so orphaning can occur when pod template labels and the selector are changed at the same time. We could validate that the new selector matches the old pod template labels as well as the new ones and that if the pod template labels aren't being changed that all replicas are "fully labeled" at the current spec generation, which is already reported by the controller in status. Any change to labels would also reset that to zero. If the controller clobbered that, it would also clobber observedGeneration, which would cause the check above to fail on subsequent updates.

That would enable safe selector mutations, such as adding not clauses.

It would also happen to allow 2-part updates, but we don't necessarily have to advertise that. Cascading deletion + recreate (which should be doable automatically with one of the apply brute-force flags) is easier for users without high availability needs.

@bgrant0607
Copy link
Member

I'm not fond of the idea of gating spec changes on status, though. This violates my "K8s APIs don't go out to lunch" principle.

@xiang90
Copy link
Contributor

xiang90 commented Aug 24, 2017

My main concern with mutability is that we don't make it easy to accidentally orphan pods. We already have validation that the selector actually matches the pod template, so orphaning can occur when pod template labels and the selector are changed at the same time. We could validate that the new selector matches the old pod template labels as well as the new ones and that if the pod template labels aren't being changed that all replicas are "fully labeled" at the current spec generation, which is already reported by the controller in status. Any change to labels would also reset that to zero. If the controller clobbered that, it would also clobber observedGeneration, which would cause the check above to fail on subsequent updates.

How would this along help with adding !key case? We can ensure the new selector will still match old/new pod template, but the existing pods still have the old labels attached. We also need to update the labels on existing pods?

@enisoc
Copy link
Member

enisoc commented Aug 24, 2017

In case it affects any opinions on this discussion, I think there is a way to get a real validation error (instead of silently dropping the update) and have it only affect the new apps/v1beta2:

#50719 (comment)

@smarterclayton
Copy link
Contributor

Yes, we would have to initialize the rest storage for the resource differently for apps/v1beta1 and apps/v1beta2.

@kow3ns
Copy link
Member

kow3ns commented Aug 24, 2017

I think what @enisoc is saying that we should return an error from ValidateUpdate instead of failing silently in PrepareForUpdate based on the Context.

@bgrant0607
Copy link
Member

@xiang90 I have advocated usage of the !newkey pattern when no pre-existing pods were labeled with newkey. This is needed when adding another Deployment that shares all of the same labels as an existing one, but with one or more new, additional labels, such as when adding a canary Deployment. I'm not sure what the use case would be for removing a label from the pod template and adding !key at the same time.

@smarterclayton
Copy link
Contributor

smarterclayton commented Aug 24, 2017

I think what enisoc is saying that we should return an error from ValidateUpdate instead of failing silently in PrepareForUpdate based on the Context.

Yes, we would be able to do that differently based on API version by having a storage that varies based on API version, which is the only way to vary validation today.

@xiang90
Copy link
Contributor

xiang90 commented Aug 24, 2017

This is needed when adding another Deployment that shares all of the same labels as an existing one, but with one or more new, additional labels, such as when adding a canary Deployment.

@bgrant0607 thanks. now i understand the use case better.

@kow3ns
Copy link
Member

kow3ns commented Aug 28, 2017

The extensions/v1beta1 and apps/v1beta1 group versions implement inconsistent semantics for selector mutation across the workloads API surface. Let's leave out batch/v1/Job (which implements something completely different as well).

Deployment DaemonSet StatefulSet
mutable mutable immutable

For Deployments, we know that we don't handle mutations to the selector and labels gracefully (#24888, #45770), and we strongly caution against its use in the documentation.
Canries can be supported via tracking labels, and this doesn't actually require selector mutation. If we want to support promotable canaries, as per feedback and demand signals from SIG Apps, we should invest in auto-pause, not selector mutation.

For StatefulSet, the orphaning allowed by mutable selectors would likely lead to a condition where the controller dead locks and leaves the processes of the distributed system it manages in an undefined state. Adopting the semantics of Deployment and DaemonSet would be equally risky with graver consequences. StatefulSet already has an auto-pause like feature (partition) for its RollingUpdates, and this allows it to support promotable canaries.

For DaemonSet, the primary use case is to create Pods on a set of selected Nodes. We handle node selector and node label changes gracefully, and you can use these features to canary a Daemon on set of Nodes before rolling it out to the fleet. We might, in the future, consider promotable canaries via an
auto-pause feature. For DaemonSet, there isn't a compelling use case for mutable Pod selectors, and, as with StatefulSets, the consequences of orphaning, given the frequent use of hostPort and hostDir, are more severe than for a Deployment.

For the v1beta2 api surface, if our aim is consistency, we should probably start with a uniform default behavior for selector mutation for the entire surface. There are four broad categories of options available.

  1. Mutability - All of the orphaning issues for Deployment would now be applicable to StatefulSet, and we can't restrict mutability later without introducing a breaking change, but no users that mutate Deployment selectors will be broken.

  2. Immutability - This removes the issue of orphaning, but as @smaterclayton points out, users of Deployment that have ignored the suggestions in the documentation and found a way to successfully realize their use cases via selector mutation, will be not be able to consume the API without, potentially
    significant, investment. However, if we implement immutability of selectors across the API surface, we can relax the restriction at any point, prior to orafter, promoting the API to GA.

  3. Restricted Mutability - As an example, if we require that selectors, when updated, match both the current and previous versions of the objects' labels, we can be sure that unintentional orphaning is no longer a concern. If we implement this, we can relax our restriction, at any time, without breaking backward compatibility, but, if we implement the the wrong restriction (either one that is too permissive or one that fails to address users' concerns correctly), we are stuck with it for the lifetime of the API surface.

  4. Opt-In Immutability - We could add an additional flag to the spec, for instance allowOrphans, that overrides selector immutability and allows users to update their selectors and labels in such a way that orphans are possible.

(1) is irrevocable. If we do this, we live with consequences for the lifetime of the API surface. We know mutating selectors isn't the suggested way to do anything for Deployments, that its error prone, and that its confusing to users.

(3) or (4) would be less error prone, but which, if either, should we implement? Is there a better feature set that we could ship (e.g. auto-pause for promotable canaries and/or orchestrated, non-orphaning selector mutation and Pod relabeling) that would eliminate the need for either?

If we start with (2), we have 9 months to a year before the complete removal of extension/v1beta2 (assuming we can remove it that quickly) in which to assess feedback and demand signals with respect to right feature set to add to the API surface. If promotable canaries is the primary use case that we need to support, we should probably invest in auto-pause over selector mutation. If relabeling is what's required, we should invest in orchestrated, non-orphaning selector mutation and Pod relabeling and solve the problem in the general case. If some version of (3) or (4) is what users really need, we can implement that in a backward compatible way.

  1. It will be less error prone, but we will be breaking valid use cases (canaries, mutation) at the same time. That suggests that full immutability, while safer for some users, might inconvenience more users.

If users can't consume the API due to selector immutability than we need to promote features that enable their use cases. If we start with immutability, we can assess requirements and demand, and implement features to better meet the needs of users that currently depend on selector mutation. That may mean implementing restricted or opt-in mutability, or it may mean delivering an orthogonal feature set that meets their needs.

  1. We haven't really canvassed to determine how many people depend on 1 in production today (deployments are effectively production) so being able to say we aren't impacting anyone is important and if it has been done I missed.

extensions/v1beta1 and apps/v1beta1, if deprecated immediately by apps/v1beta2, has a minimum of 3 release cycles prior to removal. Its unlikely that we can completely remove extensions/v1beta1 that quickly. That gives us at least nine months to ensure that, if we are able to promote a v1 API surface in that time, we have the correct features in place to support migration of the earliest adopters to that surface.

  1. I can think of a few other ways than making this purely immutable - for instance disallowing changes to both in a single update, or allowing both to be changed only if they select the previous set - but those would be slightly confusing. Can we list the cons of those in this issue for posterity (the other issue is too fragmentary now)?

The downside to allowing non-orphaning mutation is that its very difficult to tell, based on the workload specification, which Pods have labels that correspond to which selectors.

In general, if we don't have demand signals that indicate if opt-in mutability, or some version of restricted mutability, solves most of the use cases for users that are updating their selectors and labels, and if we invest in the wrong feature, at a stability other than alpha, we're on the hook for supporting it, and it may not meet the needs of our users.

  1. If there is a valid workaround to immutability (delete then create), we should make sure our tools support it sufficiently not to break the workflow (i.e. anyone in 2 needs a way to move forward)

As @LiGgit points out, delete-create is a disruptive approach, but we could support create-delete (blue/green) for users that are willing to spend some core hours for the additional availability during the update. In general, if we have a strong demand for promotable canaries or orchestrated, non-orphaning selector mutation and Pod relabeling, we should implement it. If users need opt-in mutability or restricted mutability, we should implement one of those.

  1. Preventing update of deployments selectors doesn't prevent mutation of the labels of replica sets, so we're not completely closing the hole, although modifying replica sets does seem more footgun.

Concur, but blocking label updates to ReplicaSets or ControllerRevisions is almost certainly too restrictive.

  1. We don't seem to have a clear idea of how we would fix this in the long run, which worries me slightly.

We have options to fix this in a consistent way across the apps v1beta2 surface. By starting with immutable selectors, we are only reserving the right to defer implementation until we have stronger signals for the feature set that will provide the most value to users while ensuring the default behavior is
sane.

  1. Silently dropping mutations worries me a lot, because that means people who depended on this behavior (2) are now silently broken, which doesn't seem excusable.

As @enisoc suggests, we can, and should, implement this without silently dropping the
mutation error. Validation is a better choice.

@smarterclayton
Copy link
Contributor

I'm mostly convinced by the arguments above. I agree that getting a strong signal on mutability would be important, and we can only pick immutability once. We need to communicate this change very widely if we do, and make sure no one in production is surprised.

starlingx-github pushed a commit to starlingx/ansible-playbooks that referenced this issue Sep 8, 2021
When performing a k8s upgrade from 1.18.1 to 1.19.13, the
networking images will be upgraded as well.  As part of this
upgrade, the respective kubernetes spec templates which
align with the new images will be used.

An issue has been seen in which the rolling upgrade of the
SR-IOV daemonsets fail because the latest templates specify
a 'name' as the matchLabel selector.  However, the older spec
uses 'app' and 'tier' as the match selector.

I believe that as of apps/v1, the selector label is still
immutable for controllers such as daemonsets that have
already been deployed.

Some discussion on the topic can be found here:

kubernetes/kubernetes#50808

For now, we'll just carry forward the 1.18 match selector.
It's possible this can be fixed in later k8s releases.

Story: 2008972
Closes-Bug: 1942351

Signed-off-by: Steven Webster <steven.webster@windriver.com>
Change-Id: Id0ca32038dc2897879786a17f9794515457cd837
fghaas added a commit to fghaas/tutor that referenced this issue Nov 22, 2021
…n Tutor version changes

Through the commonLabels directive in kustomization.yml, all resources
get a label named "app.kubernetes.io/version", which is being set to
the Tutor version at the time of initial deployment.

When the user then subsequently progresses to a new Tutor version,
Kubernetes attempts to update this label — but for Deployment,
ReplicaSet, and DaemonSet resources, this is no longer allowed as of
kubernetes/kubernetes#50808. This causes
"tutor k8s start" (at the "kubectl apply --kustomize" step) to break
with errors such as:

Deployment.apps "redis" is invalid: spec.selector: Invalid value: v1.LabelSelector{MatchLabels:map[string]string{"app.kubernetes.io/instance":"openedx-JIONBLbtByCGUYgHgr4tDWu1", "app.kubernetes.io/managed-by":"tutor", "app.kubernetes.io/name":"redis", "app.kubernetes.io/part-of":"openedx", "app.kubernetes.io/version":"12.1.7"}, MatchExpressions:[]v1.LabelSelectorRequirement(nil)}: field is immutable

Simply removing the app.kubernetes.io/version label from
kustomization.yml will permanently fix this issue for newly created
Kubernetes deployments, which will "survive" any future Tutor version
changes thereafter.

However, *existing* production Open edX deployments will need to throw
the affected Deployments away, and re-create them.

Also, add the Tutor version as a resource annotation instead, using
the `commonAnnotations` directive.

See also:
kubernetes/client-go#508
https://kubectl.docs.kubernetes.io/references/kustomize/kustomization/commonlabels/
https://kubectl.docs.kubernetes.io/references/kustomize/kustomization/commonannotations/
fghaas added a commit to fghaas/tutor that referenced this issue Nov 22, 2021
…n Tutor version changes

Through the commonLabels directive in kustomization.yml, all resources
get a label named "app.kubernetes.io/version", which is being set to
the Tutor version at the time of initial deployment.

When the user then subsequently progresses to a new Tutor version,
Kubernetes attempts to update this label — but for Deployment,
ReplicaSet, and DaemonSet resources, this is no longer allowed as of
kubernetes/kubernetes#50808. This causes
"tutor k8s start" (at the "kubectl apply --kustomize" step) to break
with errors such as:

Deployment.apps "redis" is invalid: spec.selector: Invalid value: v1.LabelSelector{MatchLabels:map[string]string{"app.kubernetes.io/instance":"openedx-JIONBLbtByCGUYgHgr4tDWu1", "app.kubernetes.io/managed-by":"tutor", "app.kubernetes.io/name":"redis", "app.kubernetes.io/part-of":"openedx", "app.kubernetes.io/version":"12.1.7"}, MatchExpressions:[]v1.LabelSelectorRequirement(nil)}: field is immutable

Simply removing the app.kubernetes.io/version label from
kustomization.yml will permanently fix this issue for newly created
Kubernetes deployments, which will "survive" any future Tutor version
changes thereafter.

However, *existing* production Open edX deployments will need to throw
the affected Deployments away, and re-create them.

Also, add the Tutor version as a resource annotation instead, using
the commonAnnotations directive.

See also:
kubernetes/client-go#508
https://kubectl.docs.kubernetes.io/references/kustomize/kustomization/commonlabels/
https://kubectl.docs.kubernetes.io/references/kustomize/kustomization/commonannotations/

Fixes overhangio#531.
fghaas added a commit to fghaas/tutor that referenced this issue Nov 23, 2021
…n Tutor version changes

Through the commonLabels directive in kustomization.yml, all resources
get a label named "app.kubernetes.io/version", which is being set to
the Tutor version at the time of initial deployment.

When the user then subsequently progresses to a new Tutor version,
Kubernetes attempts to update this label — but for Deployment,
ReplicaSet, and DaemonSet resources, this is no longer allowed as of
kubernetes/kubernetes#50808. This causes
"tutor k8s start" (at the "kubectl apply --kustomize" step) to break
with errors such as:

Deployment.apps "redis" is invalid: spec.selector: Invalid value: v1.LabelSelector{MatchLabels:map[string]string{"app.kubernetes.io/instance":"openedx-JIONBLbtByCGUYgHgr4tDWu1", "app.kubernetes.io/managed-by":"tutor", "app.kubernetes.io/name":"redis", "app.kubernetes.io/part-of":"openedx", "app.kubernetes.io/version":"12.1.7"}, MatchExpressions:[]v1.LabelSelectorRequirement(nil)}: field is immutable

Simply removing the app.kubernetes.io/version label from
kustomization.yml will permanently fix this issue for newly created
Kubernetes deployments, which will "survive" any future Tutor version
changes thereafter.

However, *existing* production Open edX deployments will need to throw
the affected Deployments away, and re-create them.

Also, add the Tutor version as a resource annotation instead, using
the commonAnnotations directive.

See also:
kubernetes/client-go#508
https://kubectl.docs.kubernetes.io/references/kustomize/kustomization/commonlabels/
https://kubectl.docs.kubernetes.io/references/kustomize/kustomization/commonannotations/

Fixes overhangio#531.
regisb pushed a commit to overhangio/tutor that referenced this issue Nov 25, 2021
…n Tutor version changes

Through the commonLabels directive in kustomization.yml, all resources
get a label named "app.kubernetes.io/version", which is being set to
the Tutor version at the time of initial deployment.

When the user then subsequently progresses to a new Tutor version,
Kubernetes attempts to update this label — but for Deployment,
ReplicaSet, and DaemonSet resources, this is no longer allowed as of
kubernetes/kubernetes#50808. This causes
"tutor k8s start" (at the "kubectl apply --kustomize" step) to break
with errors such as:

Deployment.apps "redis" is invalid: spec.selector: Invalid value: v1.LabelSelector{MatchLabels:map[string]string{"app.kubernetes.io/instance":"openedx-JIONBLbtByCGUYgHgr4tDWu1", "app.kubernetes.io/managed-by":"tutor", "app.kubernetes.io/name":"redis", "app.kubernetes.io/part-of":"openedx", "app.kubernetes.io/version":"12.1.7"}, MatchExpressions:[]v1.LabelSelectorRequirement(nil)}: field is immutable

Simply removing the app.kubernetes.io/version label from
kustomization.yml will permanently fix this issue for newly created
Kubernetes deployments, which will "survive" any future Tutor version
changes thereafter.

However, *existing* production Open edX deployments will need to throw
the affected Deployments away, and re-create them.

Also, add the Tutor version as a resource annotation instead, using
the commonAnnotations directive.

See also:
kubernetes/client-go#508
https://kubectl.docs.kubernetes.io/references/kustomize/kustomization/commonlabels/
https://kubectl.docs.kubernetes.io/references/kustomize/kustomization/commonannotations/

Fixes #531.
regisb pushed a commit to overhangio/tutor that referenced this issue Dec 20, 2021
…n Tutor version changes

Through the commonLabels directive in kustomization.yml, all resources
get a label named "app.kubernetes.io/version", which is being set to
the Tutor version at the time of initial deployment.

When the user then subsequently progresses to a new Tutor version,
Kubernetes attempts to update this label — but for Deployment,
ReplicaSet, and DaemonSet resources, this is no longer allowed as of
kubernetes/kubernetes#50808. This causes
"tutor k8s start" (at the "kubectl apply --kustomize" step) to break
with errors such as:

Deployment.apps "redis" is invalid: spec.selector: Invalid value: v1.LabelSelector{MatchLabels:map[string]string{"app.kubernetes.io/instance":"openedx-JIONBLbtByCGUYgHgr4tDWu1", "app.kubernetes.io/managed-by":"tutor", "app.kubernetes.io/name":"redis", "app.kubernetes.io/part-of":"openedx", "app.kubernetes.io/version":"12.1.7"}, MatchExpressions:[]v1.LabelSelectorRequirement(nil)}: field is immutable

Simply removing the app.kubernetes.io/version label from
kustomization.yml will permanently fix this issue for newly created
Kubernetes deployments, which will "survive" any future Tutor version
changes thereafter.

However, *existing* production Open edX deployments will need to throw
the affected Deployments away, and re-create them.

Also, add the Tutor version as a resource annotation instead, using
the commonAnnotations directive.

See also:
kubernetes/client-go#508
https://kubectl.docs.kubernetes.io/references/kustomize/kustomization/commonlabels/
https://kubectl.docs.kubernetes.io/references/kustomize/kustomization/commonannotations/

Fixes #531.
regisb pushed a commit to overhangio/tutor that referenced this issue Dec 20, 2021
…n Tutor version changes

Through the commonLabels directive in kustomization.yml, all resources
get a label named "app.kubernetes.io/version", which is being set to
the Tutor version at the time of initial deployment.

When the user then subsequently progresses to a new Tutor version,
Kubernetes attempts to update this label — but for Deployment,
ReplicaSet, and DaemonSet resources, this is no longer allowed as of
kubernetes/kubernetes#50808. This causes
"tutor k8s start" (at the "kubectl apply --kustomize" step) to break
with errors such as:

Deployment.apps "redis" is invalid: spec.selector: Invalid value: v1.LabelSelector{MatchLabels:map[string]string{"app.kubernetes.io/instance":"openedx-JIONBLbtByCGUYgHgr4tDWu1", "app.kubernetes.io/managed-by":"tutor", "app.kubernetes.io/name":"redis", "app.kubernetes.io/part-of":"openedx", "app.kubernetes.io/version":"12.1.7"}, MatchExpressions:[]v1.LabelSelectorRequirement(nil)}: field is immutable

Simply removing the app.kubernetes.io/version label from
kustomization.yml will permanently fix this issue for newly created
Kubernetes deployments, which will "survive" any future Tutor version
changes thereafter.

However, *existing* production Open edX deployments will need to throw
the affected Deployments away, and re-create them.

Also, add the Tutor version as a resource annotation instead, using
the commonAnnotations directive.

See also:
kubernetes/client-go#508
https://kubectl.docs.kubernetes.io/references/kustomize/kustomization/commonlabels/
https://kubectl.docs.kubernetes.io/references/kustomize/kustomization/commonannotations/

Fixes #531.
neptune-web added a commit to neptune-web/kubernetes-microservices-sam that referenced this issue Jun 12, 2022
In certain situations (see details below), the deployment to Kubernetes fails with:

> "The Deployment [DEPLOYMENT_OBJECT] is invalid: [...] `selector` does not match template `labels`".

This is caused by the K8S Deployment manifests missing an explicit `selector` value.

This commit:
* adds explicit `selector` values for all Deployment objects.
* bumps the K8S API from the deprecated `extensions/v1beta1` version to the stable `apps/v1` version. This version made the `selector` property of the Deployment a required value, preventing any issues with missing selectors in the future.

This change is backwards compatible with existing deployments of the microservices demo app. I.e. you should be able to pull this change and run `skaffold run` against an existing deployment of the app without issues.

This will not however resolve the issue for existing deployments. Selectors are immutable and will therefore retain their current defaulted value. You should run `skaffold delete` followed by `skaffold run` after having pulled this change to do a clean re-deployment of the app, which will resolve the issue.

**The nitty-gritty details**

In the `extensions/v1beta1` version of K8S API (the version that was used by this project), the `selector` property of a Deployment object is optional and is defaulted to the labels used in the pod template. This can cause subtle issues leading to deployment failures. This project, where Deployment selectors were omitted, is a good example of what can go wrong with defaulted selectors.

Consider this:

1. Run `skaffold run` to build locally with Docker and deploy.

Since the Deployment specs don't have explict selectors, they will be defaulted to the pod template labels. And since skaffold adds additional labels to the pod template like `skaffold-builder` and `skaffold-deployer`, the end-result will be a selector that looks like this:

> app=cartservice,cleanup=true,docker-api-version=1.39,skaffold-builder=local,skaffold-deployer=kubectl,skaffold-tag-policy=git-commit,tail=true

So far, so good.

2. Now run `skaffold run -p gcb --default-repo=your-gcr-repo` to build on Google Cloud Build instead of building locally.

This will blow up when attempting to deploy to Kubernetes with an error similar to:

> The Deployment "cartservice" is invalid: spec.template.metadata.labels: Invalid value: map[string]string{"skaffold-builder":"google-cloud-build", "profiles"="gcb", "skaffold-deployer":"kubectl", "skaffold-tag-policy":"git-commit", "docker-api-version":"1.39", "tail":"true", "app":"cartservice", "cleanup":"true"}: `selector` does not match template `labels`

(and the same error for every other deployment object)

This is because the skaffold labels that were automatically added to the pod template have changed to include references to Google Cloud Build. That normally shouldn't be an issue.

But without explicit Deployment selectors, this results in the defaulted selectors for our Deployment objects to have also changed. Which means that the new version of our Deployment objects are now managing different sets of Pods. Which is thankfully caught by kubectl before the deployment happens (otherwise this would have resulted in orphaned pods).

In this commit, we explicitely set the `selector` value of all Deployment objects, which fixes this issue. We also bump the K8S API version to the stable `apps/v1`, which makes the `selector` property a required value and will avoid accidently forgetting selectors in the future.

More details if you're curious:

* Why defaulted Deployment selectors cause problems: kubernetes/kubernetes#26202
* Why Deployment selectors should be (and were made) immutable: kubernetes/kubernetes#50808
oplik0 pushed a commit to oplik0/bso that referenced this issue Jun 6, 2024
In certain situations (see details below), the deployment to Kubernetes fails with:

> "The Deployment [DEPLOYMENT_OBJECT] is invalid: [...] `selector` does not match template `labels`".

This is caused by the K8S Deployment manifests missing an explicit `selector` value.

This commit:
* adds explicit `selector` values for all Deployment objects.
* bumps the K8S API from the deprecated `extensions/v1beta1` version to the stable `apps/v1` version. This version made the `selector` property of the Deployment a required value, preventing any issues with missing selectors in the future.

This change is backwards compatible with existing deployments of the microservices demo app. I.e. you should be able to pull this change and run `skaffold run` against an existing deployment of the app without issues.

This will not however resolve the issue for existing deployments. Selectors are immutable and will therefore retain their current defaulted value. You should run `skaffold delete` followed by `skaffold run` after having pulled this change to do a clean re-deployment of the app, which will resolve the issue.

**The nitty-gritty details**

In the `extensions/v1beta1` version of K8S API (the version that was used by this project), the `selector` property of a Deployment object is optional and is defaulted to the labels used in the pod template. This can cause subtle issues leading to deployment failures. This project, where Deployment selectors were omitted, is a good example of what can go wrong with defaulted selectors.

Consider this:

1. Run `skaffold run` to build locally with Docker and deploy.

Since the Deployment specs don't have explict selectors, they will be defaulted to the pod template labels. And since skaffold adds additional labels to the pod template like `skaffold-builder` and `skaffold-deployer`, the end-result will be a selector that looks like this:

> app=cartservice,cleanup=true,docker-api-version=1.39,skaffold-builder=local,skaffold-deployer=kubectl,skaffold-tag-policy=git-commit,tail=true

So far, so good.

2. Now run `skaffold run -p gcb --default-repo=your-gcr-repo` to build on Google Cloud Build instead of building locally.

This will blow up when attempting to deploy to Kubernetes with an error similar to:

> The Deployment "cartservice" is invalid: spec.template.metadata.labels: Invalid value: map[string]string{"skaffold-builder":"google-cloud-build", "profiles"="gcb", "skaffold-deployer":"kubectl", "skaffold-tag-policy":"git-commit", "docker-api-version":"1.39", "tail":"true", "app":"cartservice", "cleanup":"true"}: `selector` does not match template `labels`

(and the same error for every other deployment object)

This is because the skaffold labels that were automatically added to the pod template have changed to include references to Google Cloud Build. That normally shouldn't be an issue.

But without explicit Deployment selectors, this results in the defaulted selectors for our Deployment objects to have also changed. Which means that the new version of our Deployment objects are now managing different sets of Pods. Which is thankfully caught by kubectl before the deployment happens (otherwise this would have resulted in orphaned pods).

In this commit, we explicitely set the `selector` value of all Deployment objects, which fixes this issue. We also bump the K8S API version to the stable `apps/v1`, which makes the `selector` property a required value and will avoid accidently forgetting selectors in the future.

More details if you're curious:

* Why defaulted Deployment selectors cause problems: kubernetes/kubernetes#26202
* Why Deployment selectors should be (and were made) immutable: kubernetes/kubernetes#50808
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/apps Categorizes an issue or PR as relevant to SIG Apps.
Projects
None yet
9 participants