Deployment rollout status not reliable when using `status.conditions` #124264

sicavz · 2024-04-11T04:54:08Z

What happened?

Currently I am working on a project where we implement a Kubernetes operator and we decided that for some flows we will need to wait for some deployments to complete the rollout before advancing to some other steps from our processes.
So, for an old deployment we only update an image in the pod template and then we wait for the updated deployment to finish the rollout.

The way the status of the Deployment is built/implemented is error prone and not reliable if we only take into account the status.conditions (as suggested in the documentation)

What did you expect to happen?

After updating the spec of a deployment, when the status.ObservedGeneration is changed to the latest metadata.Generation, the status.conditions should also be updated.

How can we reproduce it (as minimally and precisely as possible)?

We found out that if:

we start with a Deployment having a status.condition of type Progressing with the reason ProgressDeadlineExceeded
update the deployment spec (the only change would be an image from a container from the pod template)
on a timer (every second), start retrieving the deployment using the API and check the status
Then, under some circumstances (so, not always; it's a concurrency/timing issue), the first change of the status.observedGeneration toward the new metadata.generation value, comes with no change(!) in the status.conditions (the old conditions are still there without any single change(!)).
According to the Kubernetes documentation (and also the way the Operator SDK checks this rollout status), the interpretation of these would lead to say that the rollout failed, which is not accurate, because the actual rollout will only start a few seconds later...

This is an example from a log describing the issue (please observe that the two blocks are logged at a few milliseconds distance and the status.conditions are the same even the status.ObservedGeneration changed):

# 10:53:18.814 > Metadata.Generation: 2 | Status >> ObservedGeneration: 1, replicas: 1 , readyReplicas: , availableReplicas: , updatedReplicas: 1, unAvailableReplicas: 1
  Conditions:
    @10-Apr-24 10:53:02 | 10-Apr-24 10:53:02 | Available | False | MinimumReplicasUnavailable | Deployment does not have minimum availability.
    @10-Apr-24 10:53:18 | 10-Apr-24 10:53:18 | Progressing | False | ProgressDeadlineExceeded | ReplicaSet "my-deployment-5b8dc498c" has timed out progressing.
 => completed: False, failed: False

# 10:53:18.818 > Metadata.Generation: 2 | Status >> ObservedGeneration: 2, replicas: 1 , readyReplicas: , availableReplicas: , updatedReplicas: , unAvailableReplicas: 1
  Conditions:
    @10-Apr-24 10:53:02 | 10-Apr-24 10:53:02 | Available | False | MinimumReplicasUnavailable | Deployment does not have minimum availability.
    @10-Apr-24 10:53:18 | 10-Apr-24 10:53:18 | Progressing | False | ProgressDeadlineExceeded | ReplicaSet "my-deployment-5b8dc498c" has timed out progressing.
 => completed: False, failed: True

Anything else we need to know?

Please advise on how to reliably interpret the rollout status of a Deployment.

Kubernetes version

$ kubectl version
Client Version: v1.28.8+k3s1
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.8+k3s1

Cloud provider

no cloud

OS version

# On Linux:
$ cat /etc/os-release
PRETTY_NAME="K3s v1.28.8+k3s1"


$ uname -a
 Linux 3432b46b9978 5.15.0-101-generic #111-Ubuntu SMP Tue Mar 5 20:16:58 UTC 2024 x86_64 GNU/Linux



</details>


### Install tools

<details>

</details>


### Container runtime (CRI) and version (if applicable)

<details>

</details>


### Related plugins (CNI, CSI, ...) and versions (if applicable)

<details>

</details>

The text was updated successfully, but these errors were encountered:

neolit123 · 2024-04-11T08:22:12Z

/sig apps

harshap804 · 2024-04-16T02:19:18Z

I am trying to fix some unit test on my local environement to test a fix for this. This is what I am trying to do:

Modified updateDeployment function, we first check if the deployment's spec has been updated by comparing the old and current specs. If the spec has been updated, we log the update, update the deployment's spec, set the status.ObservedGeneration to the current metadata.Generation, and update the deployment's status conditions based on the rollout progress. Finally, we enqueue the deployment for further processing.

harshap804 · 2024-04-16T02:20:55Z

And this is how I am doing it:

func (dc *DeploymentController) updateDeployment(logger klog.Logger, old, cur interface{}) {
    oldD := old.(*apps.Deployment)
    curD := cur.(*apps.Deployment)
    if !reflect.DeepEqual(oldD.Spec, curD.Spec) {
        logger.V(4).Info("Updating deployment", "deployment", klog.KObj(oldD))
        oldD.Spec = curD.Spec
        oldD.Status.ObservedGeneration = curD.ObjectMeta.Generation
        if oldD.Status.UpdatedReplicas == *oldD.Spec.Replicas {
            oldD.Status.Conditions = []apps.DeploymentCondition{
                {
                    Type:   apps.DeploymentAvailable,
                    Status: corev1.ConditionTrue,
                },
            }
        } else {
            oldD.Status.Conditions = []apps.DeploymentCondition{
                {
                    Type:   apps.DeploymentAvailable,
                    Status: corev1.ConditionFalse,
                },
            }
        }        
        dc.enqueueDeployment(oldD)
    }
}

atiratree · 2024-04-26T13:16:47Z

/triage accepted
/priority backlog

atiratree · 2024-04-26T13:22:21Z

I can only reproduce this if the ReplicaSet with the new imaged existed before. If the new ReplicaSet does not exist, the progress is indicated immediately.

@sicavz
Does this happen only in this case for you? Can you share a reproducer or your deployment?

I think #124558 might fix it. Can you check if it helps?

@harshap804 FYI, it is not a good idea to mutate the shared cache in the event handlers.

sicavz added the kind/bug Categorizes issue or PR as related to a bug. label Apr 11, 2024

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 11, 2024

k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 11, 2024

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/backlog Higher priority than priority/awaiting-more-evidence. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 26, 2024

atiratree linked a pull request Apr 26, 2024 that will close this issue

[WIP] Deployment status Progressing condition should respond immediately when rolling out a new revision #124558

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deployment rollout status not reliable when using `status.conditions` #124264

Deployment rollout status not reliable when using `status.conditions` #124264

sicavz commented Apr 11, 2024

neolit123 commented Apr 11, 2024

harshap804 commented Apr 16, 2024

harshap804 commented Apr 16, 2024

atiratree commented Apr 26, 2024

atiratree commented Apr 26, 2024

Deployment rollout status not reliable when using status.conditions #124264

Deployment rollout status not reliable when using status.conditions #124264

Comments

sicavz commented Apr 11, 2024

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

neolit123 commented Apr 11, 2024

harshap804 commented Apr 16, 2024

harshap804 commented Apr 16, 2024

atiratree commented Apr 26, 2024

atiratree commented Apr 26, 2024

Deployment rollout status not reliable when using `status.conditions` #124264

Deployment rollout status not reliable when using `status.conditions` #124264