New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rolling update of deployments creates MAX number of pods X 2 according to HPA, regardless of current load #84142
Comments
@kubernetes/sig-autoscalling-bugs |
This bug does only apply to the |
same issue |
/sig autoscaling |
I've been looking into this and have made some interesting observations about the behaviour within our systems at least: Firstly, during the update, we see kubernetes/pkg/controller/podautoscaler/horizontal.go Lines 624 to 627 in 4485c6f
Secondly, we see the scale doubling (approximately) during the rolling update. The observation I've made it that it seems to be getting updated to the total count of the old and new replicasets pods combined, as if it is using the number of pods that match the label matcher for the deployment as the desired count when it finally updates the scale. I have yet to find a piece of code that verifies this though, merely an observation. Hopefully this helps narrow down the bug, it may also be a wild goose chase 😅 I can do more testing if anyone has any ideas, just let me know |
Just for information, I am seeing similar behaviour using a custom metric, not CPU - I am using |
In
|
I'd like to fix this issue in #85027 . |
@shibataka000 as workaround, can we just set |
@shibataka000 , just to make sure this is not forgotten Or is your #84142 (comment) about
|
@max-rocket-internet , if I'm not mistaken this bug is related to Edit: There is actually no such thing as a |
Sorry, I meant So, as workaround, can we just set |
Unfortunately If the startup time for your pod is 10 seconds, replacing all 200 pods would take 10 seconds with And according to the documentation, setting Example: when this is set to 30%, the old ReplicaSet can be scaled down to 70% of desired pods immediately when the rolling update starts. Once new pods are ready, old ReplicaSet can be scaled down further, followed by scaling up the new ReplicaSet, ensuring that the total number of pods available at all times during the update is at least 70% of desired pods. My understanding is that the |
@paalkr @max-rocket-internet I'm sorry. |
Unfortunately no. When I try scenario described in https://gist.github.com/skaji/575e4a383fac0f0d11b5b51220a6049c with |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
What is the status of this issue? I still believe it's a critical bug that needs to be addressed. |
+1 |
We may be running into the same or a problem that may be connected to this. We run a deployment with
The HPA, which targets the deployment, is set to scale on a custom metric which is scraped every 30 seconds (as the value is a calculated as a When the rolling update runs, even with completely idle deployment at 1 replicas, the HPA starts to add replicas to the deployment until it either hits the max replicas (which is 8 for this deployment) or until the first Pod generates a metric. The in calcPlainMetricReplicas method works as intended. It uses 100% of the target value for pods missing metrics, so, with 2 replicas running, one at 100% and the other one at 1% it still returns 2 replicas as the desired replicas count. I believe the problem is related to how the current number of replicas is taken from the deployment spec during HPA reconciliation (code). During a rolling update, we saw the deployment spec and the deployment status being out of sync for long periods of time (over 3 minutes).
This does not always happen, but it happens most of the times we do a rolling update. The HPA status is also confusing, as it indicates that the metric is above target when in fact is well below target:
Should the HPA reconciliation use the current number of replicas from the status rather that from the spec? |
@paalkr thanks for the report. I am able to reproduce this on 1.14.10 with your test yaml. I will find the root cause and come back to you shortly. |
/assign josephburnett |
This issue is reproduced in my environment (v1.15.9) too, and it's resolved by replacing the metric type from PodsMetricSource to ObjectMetricSource. I suspect the formula calculating the average is wrong. |
I've cherry picked #85027 on top of 1.14.8 and it indeed fixes the bug. I've approved the PR and requested to fix it in |
@josephburnett , thx a lot! So in the next release of 1.14, 1.15, 1.16 and 1.17 this bug will be fixed? |
Unless I'm mistaken this won't be cherry-picked back to 1.14 as it's fallen out of support as of the GA release of 1.17 as per the support policy. At best it'll be getting back to 1.15. |
I'll work on backporting this to 1.15 - 1.18. |
This bug is affecting us badly , Is this fix merged in 1.16.x release ? if not is there a temporary solution to fix this issue |
@shyamjos |
Fix bug about unintentional scale out during updating deployment. **MR 描述 / 目的**: kubernetes#89465 **关联 issue**: Fixes kubernetes#84142 **代码审查须知**: **MR 是否对用户有影响?**: ```发布须知 ```
What happened:
The bug described in #72775 is still present in 1.14.8
When a change is made to the running deployment that requires a rolling update to be executed, both the current replicaset and the new replicaset is scaled out to max number of pods regardless of actual load.
What you expected to happen:
The bug to be fixed
#79035
#79708
How to reproduce it (as minimally and precisely as possible):
Follow this guide https://gist.github.com/skaji/575e4a383fac0f0d11b5b51220a6049c
The cluster does NOT have to be an EKS cluster, the bug is not related to the underlying platform or infrastructure.
Anything else we need to know?:
This is bug is really critical, and should be addressed and hotfixed as soon as possible.
Environment:
kubectl version
):Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.3", GitCommit:"2d3c76f9091b6bec110a5e63777c332469e0cba2", GitTreeState:"clean", BuildDate:"2019-08-19T11:13:54Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"windows/amd64"} Server Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.8", GitCommit:"211047e9a1922595eaa3a1127ed365e9299a6c23", GitTreeState:"clean", BuildDate:"2019-10-15T12:02:12Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"linux/amd64"}
cat /etc/os-release
): Container Linux by CoreOS 2247.5.0 (Rhyolite)uname -a
): 4.19.78-coreosThe text was updated successfully, but these errors were encountered: