Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deployment rollout status not reliable when using status.conditions #124264

Open
sicavz opened this issue Apr 11, 2024 · 5 comments · May be fixed by #124558
Open

Deployment rollout status not reliable when using status.conditions #124264

sicavz opened this issue Apr 11, 2024 · 5 comments · May be fixed by #124558
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/backlog Higher priority than priority/awaiting-more-evidence. sig/apps Categorizes an issue or PR as relevant to SIG Apps. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@sicavz
Copy link

sicavz commented Apr 11, 2024

What happened?

Currently I am working on a project where we implement a Kubernetes operator and we decided that for some flows we will need to wait for some deployments to complete the rollout before advancing to some other steps from our processes.
So, for an old deployment we only update an image in the pod template and then we wait for the updated deployment to finish the rollout.

The way the status of the Deployment is built/implemented is error prone and not reliable if we only take into account the status.conditions (as suggested in the documentation)

What did you expect to happen?

After updating the spec of a deployment, when the status.ObservedGeneration is changed to the latest metadata.Generation, the status.conditions should also be updated.

How can we reproduce it (as minimally and precisely as possible)?

We found out that if:

  • we start with a Deployment having a status.condition of type Progressing with the reason ProgressDeadlineExceeded
  • update the deployment spec (the only change would be an image from a container from the pod template)
  • on a timer (every second), start retrieving the deployment using the API and check the status
    Then, under some circumstances (so, not always; it's a concurrency/timing issue), the first change of the status.observedGeneration toward the new metadata.generation value, comes with no change(!) in the status.conditions (the old conditions are still there without any single change(!)).
    According to the Kubernetes documentation (and also the way the Operator SDK checks this rollout status), the interpretation of these would lead to say that the rollout failed, which is not accurate, because the actual rollout will only start a few seconds later...

This is an example from a log describing the issue (please observe that the two blocks are logged at a few milliseconds distance and the status.conditions are the same even the status.ObservedGeneration changed):

# 10:53:18.814 > Metadata.Generation: 2 | Status >> ObservedGeneration: 1, replicas: 1 , readyReplicas: , availableReplicas: , updatedReplicas: 1, unAvailableReplicas: 1
  Conditions:
    @10-Apr-24 10:53:02 | 10-Apr-24 10:53:02 | Available | False | MinimumReplicasUnavailable | Deployment does not have minimum availability.
    @10-Apr-24 10:53:18 | 10-Apr-24 10:53:18 | Progressing | False | ProgressDeadlineExceeded | ReplicaSet "my-deployment-5b8dc498c" has timed out progressing.
 => completed: False, failed: False

# 10:53:18.818 > Metadata.Generation: 2 | Status >> ObservedGeneration: 2, replicas: 1 , readyReplicas: , availableReplicas: , updatedReplicas: , unAvailableReplicas: 1
  Conditions:
    @10-Apr-24 10:53:02 | 10-Apr-24 10:53:02 | Available | False | MinimumReplicasUnavailable | Deployment does not have minimum availability.
    @10-Apr-24 10:53:18 | 10-Apr-24 10:53:18 | Progressing | False | ProgressDeadlineExceeded | ReplicaSet "my-deployment-5b8dc498c" has timed out progressing.
 => completed: False, failed: True

Anything else we need to know?

Please advise on how to reliably interpret the rollout status of a Deployment.

Kubernetes version

$ kubectl version
Client Version: v1.28.8+k3s1
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.8+k3s1

Cloud provider

no cloud

OS version

# On Linux:
$ cat /etc/os-release
PRETTY_NAME="K3s v1.28.8+k3s1"


$ uname -a
 Linux 3432b46b9978 5.15.0-101-generic #111-Ubuntu SMP Tue Mar 5 20:16:58 UTC 2024 x86_64 GNU/Linux



</details>


### Install tools

<details>

</details>


### Container runtime (CRI) and version (if applicable)

<details>

</details>


### Related plugins (CNI, CSI, ...) and versions (if applicable)

<details>

</details>
@sicavz sicavz added the kind/bug Categorizes issue or PR as related to a bug. label Apr 11, 2024
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 11, 2024
@neolit123
Copy link
Member

/sig apps

@k8s-ci-robot k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 11, 2024
@harshap804
Copy link

I am trying to fix some unit test on my local environement to test a fix for this. This is what I am trying to do:

Modified updateDeployment function, we first check if the deployment's spec has been updated by comparing the old and current specs. If the spec has been updated, we log the update, update the deployment's spec, set the status.ObservedGeneration to the current metadata.Generation, and update the deployment's status conditions based on the rollout progress. Finally, we enqueue the deployment for further processing.

@harshap804
Copy link

And this is how I am doing it:

func (dc *DeploymentController) updateDeployment(logger klog.Logger, old, cur interface{}) {
    oldD := old.(*apps.Deployment)
    curD := cur.(*apps.Deployment)
    if !reflect.DeepEqual(oldD.Spec, curD.Spec) {
        logger.V(4).Info("Updating deployment", "deployment", klog.KObj(oldD))
        oldD.Spec = curD.Spec
        oldD.Status.ObservedGeneration = curD.ObjectMeta.Generation
        if oldD.Status.UpdatedReplicas == *oldD.Spec.Replicas {
            oldD.Status.Conditions = []apps.DeploymentCondition{
                {
                    Type:   apps.DeploymentAvailable,
                    Status: corev1.ConditionTrue,
                },
            }
        } else {
            oldD.Status.Conditions = []apps.DeploymentCondition{
                {
                    Type:   apps.DeploymentAvailable,
                    Status: corev1.ConditionFalse,
                },
            }
        }        
        dc.enqueueDeployment(oldD)
    }
}

@atiratree
Copy link
Member

/triage accepted
/priority backlog

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/backlog Higher priority than priority/awaiting-more-evidence. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 26, 2024
@atiratree
Copy link
Member

I can only reproduce this if the ReplicaSet with the new imaged existed before. If the new ReplicaSet does not exist, the progress is indicated immediately.

@sicavz
Does this happen only in this case for you? Can you share a reproducer or your deployment?

I think #124558 might fix it. Can you check if it helps?

@harshap804 FYI, it is not a good idea to mutate the shared cache in the event handlers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/backlog Higher priority than priority/awaiting-more-evidence. sig/apps Categorizes an issue or PR as relevant to SIG Apps. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Status: Needs Triage
5 participants