Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rolling update of deployments creates MAX number of pods X 2 according to HPA, regardless of current load #84142

Closed
paalkr opened this issue Oct 21, 2019 · 30 comments · Fixed by #85027
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling.

Comments

@paalkr
Copy link

paalkr commented Oct 21, 2019

What happened:
The bug described in #72775 is still present in 1.14.8
When a change is made to the running deployment that requires a rolling update to be executed, both the current replicaset and the new replicaset is scaled out to max number of pods regardless of actual load.

What you expected to happen:
The bug to be fixed
#79035
#79708

How to reproduce it (as minimally and precisely as possible):
Follow this guide https://gist.github.com/skaji/575e4a383fac0f0d11b5b51220a6049c
The cluster does NOT have to be an EKS cluster, the bug is not related to the underlying platform or infrastructure.

Anything else we need to know?:
This is bug is really critical, and should be addressed and hotfixed as soon as possible.

Environment:

  • Kubernetes version (use kubectl version):
    Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.3", GitCommit:"2d3c76f9091b6bec110a5e63777c332469e0cba2", GitTreeState:"clean", BuildDate:"2019-08-19T11:13:54Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"windows/amd64"} Server Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.8", GitCommit:"211047e9a1922595eaa3a1127ed365e9299a6c23", GitTreeState:"clean", BuildDate:"2019-10-15T12:02:12Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration: AWS EC2
  • OS (e.g: cat /etc/os-release): Container Linux by CoreOS 2247.5.0 (Rhyolite)
  • Kernel (e.g. uname -a): 4.19.78-coreos
  • Install tools: kube-aws v.0.14.2 (https://github.com/kubernetes-incubator/kube-aws)
@paalkr paalkr added the kind/bug Categorizes issue or PR as related to a bug. label Oct 21, 2019
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Oct 21, 2019
@paalkr
Copy link
Author

paalkr commented Oct 21, 2019

@kubernetes/sig-autoscalling-bugs

@paalkr
Copy link
Author

paalkr commented Oct 22, 2019

This bug does only apply to the autoscaling/v2beta2 and autoscaling/v2beta1 API version, not the autoscaling/v1 API.

@thdrnsdk
Copy link
Contributor

same issue
i think it was really critical iuuse so hope hot fixed asap

@neolit123
Copy link
Member

/sig autoscaling

@k8s-ci-robot k8s-ci-robot added sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 24, 2019
@JoelSpeed
Copy link
Contributor

I've been looking into this and have made some interesting observations about the behaviour within our systems at least:

Firstly, during the update, we see New size: x; reason: All metrics below target even though x is greater in size than the current replica count, and this continues to increase over time. This suggests the bug may be somewhere in here:

if desiredReplicas < currentReplicas {
rescaleReason = "All metrics below target"
}
desiredReplicas = a.normalizeDesiredReplicas(hpa, key, currentReplicas, desiredReplicas)

currentReplicas is set to scale.Spec.Replicas and so I believe desiredReplicas is being increased (to greater than currentReplicas) by the normalizeDesiredReplicas method.

Secondly, we see the scale doubling (approximately) during the rolling update. The observation I've made it that it seems to be getting updated to the total count of the old and new replicasets pods combined, as if it is using the number of pods that match the label matcher for the deployment as the desired count when it finally updates the scale. I have yet to find a piece of code that verifies this though, merely an observation.

Hopefully this helps narrow down the bug, it may also be a wild goose chase 😅 I can do more testing if anyone has any ideas, just let me know

@qoobic
Copy link

qoobic commented Nov 7, 2019

Just for information, I am seeing similar behaviour using a custom metric, not CPU - I am using sum(rate(http_server_requests_seconds_count[5m])). In our dev environment the config is min of 1, max of 3. Even when there are no requests coming in, hpa is scaling to 3 or 4 pods then cooling back down to 3 pods after 5 minutes.

@shibataka000
Copy link
Contributor

shibataka000 commented Nov 9, 2019

In Step3

  • CPU Utilization of Pod-1 is 1 / 100 = 0.01.
  • CPU Utilization of Pod-2 is 1 / 100 = 0.01.

So CPU Utilization of Deployment is (1 + 1) / (100 + 100) = 0.01.

replicaCount = ceil(readyPodCount * CurrentResourceUtilization / desiredResourceUtilization) = ceil(2 * 0.01 / 0.5) = 1

So calculated number of replicas is 1. But minReplicas is 2, so as a result, new number of replicas is 2.

In Step 5

New Pod-3 is created due to RollingUpdate with maxSurge: 1 and minUnavailable maxUnavailable: 0 .

  • CPU Utilization of Pod-1 is 1 / 100 = 0.01.
  • CPU Utilization of Pod-2 is 1 / 100 = 0.01.
  • CPU Utilization of Pod-3 is 100 / 100 = 1.0 because metrics-server can't provide metrics of freshly maded Pod-3 and hpa-controller treats pod's resource utilization as 1.0 when deployment tend to scale in.

So CPU Utilization of Deployment is (1 + 1 + 100) / (100 + 100 + 100) = 0.34.

replicaCount = ceil(len(metrics) * CurrentResourceUtilization / desiredResourceUtilization) = ceil(3 * 0.34 / 0.5) = 3

So new number of replicas is 3 in spite of CurrentResourceUtilization / desiredResourceUtilization < 1.0 .

@shibataka000
Copy link
Contributor

I'd like to fix this issue in #85027 .

@max-rocket-internet
Copy link

@shibataka000 as workaround, can we just set minUnavailable: 1 to solve it?

@paalkr
Copy link
Author

paalkr commented Nov 11, 2019

@shibataka000 , just to make sure this is not forgotten
Specifying maxUnavailable: 0 also results in an uncontrolled roll out. Please verify that the fix in #85027 apply to the maxSurge: 1 and maxUnavailable: 0 scenario as well.

Or is your #84142 (comment) about minUnavailable: 0 just a typo? Your PR actually specifies maxUnavailable: 0, which is what I reported in this bug.

In Step 5
New Pod-3 is created due to RollingUpdate with maxSurge: 1 and minUnavailable: 0.

@paalkr
Copy link
Author

paalkr commented Nov 11, 2019

@max-rocket-internet , if I'm not mistaken this bug is related to maxUnavailable: 0, not minUnavailable: 0. So specifying minUnavailable: 1 won't help.

Edit: There is actually no such thing as a minUnavailable:, which wouldn't make any logical sense either.
https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.16/#deploymentstrategy-v1-apps

@max-rocket-internet
Copy link

Sorry, I meant maxUnavailable: 0 but copy pasted the incorrect text. I think minUnavailable is simply a typo.

So, as workaround, can we just set maxUnavailable: 1 to solve it?

@paalkr
Copy link
Author

paalkr commented Nov 11, 2019

Unfortunately maxUnavailable: 1 is really not a good workaround. Say you have a deployment with 200 pods, maxUnavailable: 1 would slow down the rolling update dramatically, as only one new pod could be created at a time, instead of all 200 at once.

If the startup time for your pod is 10 seconds, replacing all 200 pods would take 10 seconds with maxUnavailable: 0, but 2000 seconds (33.3 minutes) with maxUnavailable: 1

And according to the documentation, setting maxUnavailable to anything except for 0 will reduce the number of pods (basically reducing capacity) during a rolling update. We cannot tolerate any reduced capacity during a rolling update, as the service powered by the pods might get saturated with incoming customer requests.

Example: when this is set to 30%, the old ReplicaSet can be scaled down to 70% of desired pods immediately when the rolling update starts. Once new pods are ready, old ReplicaSet can be scaled down further, followed by scaling up the new ReplicaSet, ensuring that the total number of pods available at all times during the update is at least 70% of desired pods.

My understanding is that the maxUnavailable setting is to ensure that your cluster (worker nodes) are not saturated during a rolling update. In our case we do have enough free capacity in the cluster to always be able to immediately replace all pods for any running service.

@shibataka000
Copy link
Contributor

@paalkr @max-rocket-internet I'm sorry. minUnavailable is typo. maxUnavailable is correct.

@shibataka000
Copy link
Contributor

@max-rocket-internet

as workaround, can we just set maxUnavailable: 1 to solve it?

Unfortunately no. When I try scenario described in https://gist.github.com/skaji/575e4a383fac0f0d11b5b51220a6049c with maxUnavailable: 1, HPA scale deployment regardless of current load. len(metrics) become 3 depends on timing.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 9, 2020
@paalkr
Copy link
Author

paalkr commented Feb 9, 2020

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 9, 2020
@paalkr
Copy link
Author

paalkr commented Feb 9, 2020

What is the status of this issue? I still believe it's a critical bug that needs to be addressed.

@caiohasouza
Copy link

+1

@danielgblanco
Copy link

We may be running into the same or a problem that may be connected to this. We run a deployment with

  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0
      maxSurge: 1

The HPA, which targets the deployment, is set to scale on a custom metric which is scraped every 30 seconds (as the value is a calculated as a rate it takes 60 seconds for the first value to appear).

When the rolling update runs, even with completely idle deployment at 1 replicas, the HPA starts to add replicas to the deployment until it either hits the max replicas (which is 8 for this deployment) or until the first Pod generates a metric. The in calcPlainMetricReplicas method works as intended. It uses 100% of the target value for pods missing metrics, so, with 2 replicas running, one at 100% and the other one at 1% it still returns 2 replicas as the desired replicas count.

I believe the problem is related to how the current number of replicas is taken from the deployment spec during HPA reconciliation (code). During a rolling update, we saw the deployment spec and the deployment status being out of sync for long periods of time (over 3 minutes).

$ while sleep 1; do
>     date
>     kubectl get deployment my-deployment -o=jsonpath='{.spec.replicas} {.status.replicas}'
>     echo
> done

Mon  2 Mar 2020 14:42:26 GMT
1 1
[...]
Mon  2 Mar 2020 14:43:22 GMT
1 1
Mon  2 Mar 2020 14:43:23 GMT
1 2
[...]
Mon  2 Mar 2020 14:43:28 GMT
2 3
[...]
Mon  2 Mar 2020 14:43:42 GMT
3 4
[...]
Mon  2 Mar 2020 14:43:57 GMT
4 5
[...]
Mon  2 Mar 2020 14:44:13 GMT
5 6
[...]
Mon  2 Mar 2020 14:44:27 GMT
6 7
[...]
Mon  2 Mar 2020 14:47:50 GMT
6 6
[...]
Mon  2 Mar 2020 14:49:28 GMT
5 5
[...]
Mon  2 Mar 2020 14:49:43 GMT
4 4
[...]
Mon  2 Mar 2020 14:49:58 GMT
3 3
[...]
Mon  2 Mar 2020 14:52:28 GMT
2 2
[...]
Mon  2 Mar 2020 14:52:58 GMT
1 1

This does not always happen, but it happens most of the times we do a rolling update. The HPA status is also confusing, as it indicates that the metric is above target when in fact is well below target:

  Type    Reason             Age                 From                       Message
  ----    ------             ----                ----                       -------
  Normal  SuccessfulRescale  2m44s (x3 over 5d)  horizontal-pod-autoscaler  New size: 2; reason: pods metric my_custom_metric above target
  Normal  SuccessfulRescale  2m29s (x3 over 5d)  horizontal-pod-autoscaler  New size: 3; reason: pods metric my_custom_metric above target
  Normal  SuccessfulRescale  2m14s (x3 over 5d)  horizontal-pod-autoscaler  New size: 4; reason: pods metric my_custom_metric above target
  Normal  SuccessfulRescale  119s (x3 over 5d)   horizontal-pod-autoscaler  New size: 5; reason: pods metric my_custom_metric above target
  Normal  SuccessfulRescale  104s                horizontal-pod-autoscaler  New size: 6; reason: pods metric my_custom_metric above target

Should the HPA reconciliation use the current number of replicas from the status rather that from the spec?

@josephburnett
Copy link
Member

@paalkr thanks for the report. I am able to reproduce this on 1.14.10 with your test yaml.

I will find the root cause and come back to you shortly.

@josephburnett
Copy link
Member

/assign josephburnett

@bgpat
Copy link

bgpat commented Mar 19, 2020

This issue is reproduced in my environment (v1.15.9) too, and it's resolved by replacing the metric type from PodsMetricSource to ObjectMetricSource. I suspect the formula calculating the average is wrong.

@josephburnett
Copy link
Member

I've cherry picked #85027 on top of 1.14.8 and it indeed fixes the bug. I've approved the PR and requested to fix it in calcPlainMetricReplicas as well.

@paalkr
Copy link
Author

paalkr commented Mar 24, 2020

@josephburnett , thx a lot! So in the next release of 1.14, 1.15, 1.16 and 1.17 this bug will be fixed?

@gjtempleton
Copy link
Member

Unless I'm mistaken this won't be cherry-picked back to 1.14 as it's fallen out of support as of the GA release of 1.17 as per the support policy.

At best it'll be getting back to 1.15.

@josephburnett
Copy link
Member

I'll work on backporting this to 1.15 - 1.18.

@josephburnett
Copy link
Member

Cherry pick process (for reference): https://github.com/kubernetes/community/blob/83af19fd21fe3af7cd88630e7239b974091645d2/contributors/devel/sig-release/cherry-picks.md#cherry-pick-review

1:15: #89514
1.16: #89515
1.17: #89516
1.18: #89517

@shyamjos
Copy link

shyamjos commented May 2, 2020

This bug is affecting us badly , Is this fix merged in 1.16.x release ? if not is there a temporary solution to fix this issue

@djjayeeta
Copy link

@shyamjos
This fix is available from 1.16.9.

chenchun pushed a commit to chenchun/kubernetes that referenced this issue Mar 20, 2024
Fix bug about unintentional scale out during updating deployment.
**MR 描述 / 目的**:
kubernetes#89465

**关联 issue**:
Fixes kubernetes#84142

**代码审查须知**:

**MR 是否对用户有影响?**:

```发布须知

```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling.
Projects
None yet
Development

Successfully merging a pull request may close this issue.