Add HPA #305

dturn · 2018-06-27T21:46:39Z

Add HPA Resource (part of splitting of https://github.com/Shopify/kubernetes-deploy/pull/188/files)

In v2beta1 we get access to threecondition statusesAbleToScale, ScalingActive , and ScalingLimited. I think it makes sense to succeeded when AbleToScale is true.

The first, AbleToScale, indicates whether or not the HPA is able to fetch and update scales, as well as whether or not any backoff-related conditions would prevent scaling.

The downside of using v2beta is no support in 1.7. I think its worth the tradeoff

karanthukral

LGTM

timothysmith0609

LGTM, I would suggest using the ScalingActive condition to determine deploy success. As the comment around its definition states:

ScalingActive indicates that the HPA controller is able to scale if necessary:
it's correctly configured, can fetch the desired metrics, and isn't disabled.

Though I don't particularly feel strongly one way or the other between this and AbleToScale

KnVerey · 2018-07-13T23:10:41Z

I found the difference between the intent of ScalingActive and AbleToScale a bit confusing too, so I looked up all the possible reasons for them (this file):

AbleToScale

FailedGetScale: the HPA controller was unable to get the target's current scale
BackoffDownscale: the time since the previous scale is still within the downscale forbidden window
BackoffBoth: the time since the previous scale is still within both the downscale and upscale forbidden windows
BackoffUpscale: the time since the previous scale is still within the upscale forbidden window
ReadyForNewScale: the last scale time was sufficiently old as to warrant a new scale
FailedUpdateScale: the HPA controller was unable to update the target scale
SucceededRescale: the HPA controller was able to update the target scale to %d

ScalingActive

InvalidSelector: the HPA target's scale is missing a selector
InvalidSelector: couldn't convert selector into a corresponding internal selector object
FailedGetObjectMetric/FailedGetPodsMetric/FailedGetResourceMetric/FailedGetExternalMetric: the HPA was unable to compute the replica count
InvalidMetricSourceType: the HPA was unable to compute the replica count
ValidMetricFound: the HPA was able to successfully calculate a replica count from %s
ScalingDisabled: scaling is disabled since the replica count of the target is zero

Based on that, AbleToScale is talking about ability to scale right now in particular, whereas ScalingActive tells us whether the HPA is correctly configured. So the latter is what we want. We should probably ignore false values with that last reason though, or else deploys will fail when people using HPAs scale to zero manually.

KnVerey · 2018-07-13T23:12:16Z

lib/kubernetes-deploy/kubernetes_resource/horizontal_pod_autoscaler.rb

+# frozen_string_literal: true
+module KubernetesDeploy
+  class HorizontalPodAutoscaler < KubernetesResource
+    PRUNABLE = true


I do like the idea of defining this at the class level, but this doesn't do anything right now. Your test passes because, oddly enough, HPA was already on the whitelist.

KnVerey · 2018-07-13T23:14:45Z

lib/kubernetes-deploy/kubernetes_resource/horizontal_pod_autoscaler.rb

+    end
+
+    def deploy_failed?
+      !exists?


I think this should be checking the ScalingActive condition too (but should the deploy should succeed when that's false because ScalingDisabled).

KnVerey · 2018-07-13T23:16:27Z

lib/kubernetes-deploy/kubernetes_resource/horizontal_pod_autoscaler.rb

+    end
+
+    def timeout_message
+      UNUSUAL_FAILURE_MESSAGE


I'm not sure this message (It is very unusual for this resource type to fail to deploy. Please try the deploy again. If that new deploy also fails, contact your cluster administrator.) is fitting, since HPAs have legit failure mode (unlike say configmaps).

KnVerey · 2018-07-13T23:17:29Z

lib/kubernetes-deploy/kubernetes_resource/horizontal_pod_autoscaler.rb

+    end
+
+    def type
+      'hpa.v2beta1.autoscaling'


Why this? Don't we want it called HorizontalPodAutoscaler?

We do, but this method determines how kubectl fetches the resource. And we don't get the status conditions until v2beta1. Do we want to add a new method?

KnVerey · 2018-07-13T23:20:23Z

test/fixtures/hpa/deployment.yml

+    metadata:
+      labels:
+        name: web
+        app: hello-cloud


nit: these labels should be "hpa" and "hpa-deployment"

KnVerey · 2018-07-13T23:24:32Z

test/integration/kubernetes_deploy_test.rb

@@ -1006,4 +1006,21 @@ def test_raise_on_yaml_missing_kind
      "  datapoint: value1"
    ], in_order: true)
  end
+
+  def test_hpa_can_be_successful
+    skip if KUBE_SERVER_VERSION < Gem::Version.new('1.8.0')


Your PR description says The downside of using v2beta is no support in 1.8.; did you mean 1.7? If that's true, our unofficial deprecation policy (following GKE) would allow us to drop 1.7 at this point, and then the hpa could be folded into hello-cloud, as could the cronjob tests. If that isn't the case (i.e. the conditions aren't present in a version we need to support), don't we need to have alternate success/failure conditions in the code itself rather than here, to avoid breaking deploys for people with hpas on that version?

It was added in 1.8 and I've updated the description. What does dropping support for 1.7 mean past not running 1.7 CI?

KnVerey · 2018-07-13T23:32:50Z

test/helpers/kubeclient_helper.rb

@@ -25,4 +25,8 @@ def apps_v1beta1_kubeclient
  def batch_v1beta1_kubeclient
    @batch_v1beta1_kubeclient ||= build_batch_v1beta1_kubeclient(MINIKUBE_CONTEXT)
  end
+
+  def autoscaling_v2beta1_kubeclient
+    @autoscaling_v2beta1_kubeclient ||= build_autoscaling_v2beta1_kubeclient(MINIKUBE_CONTEXT)


I'm not seeing these clients being used

KnVerey · 2018-07-13T23:34:06Z

test/integration/kubernetes_deploy_test.rb

+
+  def test_hpa_can_be_successful
+    skip if KUBE_SERVER_VERSION < Gem::Version.new('1.8.0')
+    assert_deploy_success(deploy_fixtures("hpa"))


Please try to include at least one relevant logging assertion with your tests. The logs are our UI, so we should always check (PRINT_LOGS=1) what they look like and assert that we haven't broken them. Typically I'll assert on what gets printed about the resource in the summary

dturn · 2018-07-17T00:19:24Z

It looks like I used AbleToScale vs ScalingActive because scalingActive is false in minikube:

            "status": {
                "conditions": [
                    {
                        "lastTransitionTime": "2018-07-17T00:15:37Z",
                        "message": "the HPA controller was able to get the target's current scale",
                        "reason": "SucceededGetScale",
                        "status": "True",
                        "type": "AbleToScale"
                    },
                    {
                        "lastTransitionTime": "2018-07-17T00:15:37Z",
                        "message": "the HPA was unable to compute the replica count: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server could not find the requested resource (get pods.metrics.k8s.io)",
                        "reason": "FailedGetResourceMetric",
                        "status": "False",
                        "type": "ScalingActive"
                    }
                ],
                "currentMetrics": null,
                "currentReplicas": 1,
                "desiredReplicas": 0
            }
        }

KnVerey

It looks like I used AbleToScale vs ScalingActive because scalingActive is false in minikube

Doesn't that mean that HPAs don't work in minikube? I don't think that's a good reason not to use the correct condition, even if testing will be a problem. Did you dig into the cause of that minikube problem? This issue seems to suggest the problem could be that the resource in question doesn't have req/limits set properly. Could that be the case with the fixture you're using?

We do, but this method determines how kubectl fetches the resource. And we don't get the status conditions until v2beta1. Do we want to add a new method?

I guess so? Perhaps we should consider pinning our access GVs in general. 🤔 Bottom line for this PR though is I don't think we should print an unfriendly gvk string as part of the name in our output.

It was added in 1.8 and I've updated the description. What does dropping support for 1.7 mean past not running 1.7 CI?

That's pretty much all it means (well, that and documenting it in our readme, and removing any test workarounds we have in place to support 1.7).

KnVerey · 2018-07-20T20:05:21Z

lib/kubernetes-deploy/kubernetes_resource/horizontal_pod_autoscaler.rb

+# frozen_string_literal: true
+module KubernetesDeploy
+  class HorizontalPodAutoscaler < KubernetesResource
+    TIMEOUT = 30.seconds


Should we make this a little longer? Seemingly some metrics check is happening, and I have no idea how long that can take.

dturn · 2018-07-27T20:31:36Z

Turns out the issue was kubernetes/kubernetes#57673 e.g. that we need to deploy the metric server from https://github.com/kubernetes-incubator/metrics-server.

This is ready for 👀 again

dturn · 2018-07-27T20:32:07Z

test/integration/kubernetes_deploy_test.rb

+
+  private
+
+  def deploy_metric_server


Is there a helper file we've been putting methods like this?

Sort of, yes. This is being deployed globally, so I think it should happen during the initial test suite setup, which is taken care of inside test_helper itself. I'd add it to this module:

https://github.com/Shopify/kubernetes-deploy/blob/7b47ed81f040922ad877fc7e3035f6b82a71ea87/test/test_helper.rb#L243-L254

And then call it with the PV setup, i.e. when the file is loaded

https://github.com/Shopify/kubernetes-deploy/blob/7b47ed81f040922ad877fc7e3035f6b82a71ea87/test/test_helper.rb#L289

KnVerey

Did you have to tweak the metrics server configs at all to get them working, or are they a direct copy-paste from https://github.com/kubernetes-incubator/metrics-server?

KnVerey · 2018-07-31T19:59:37Z

lib/kubernetes-deploy/kubernetes_resource.rb

@@ -146,6 +146,10 @@ def type
      @type || self.class.kind
    end

+    def fetch_type
+      type


re: the policial check, we can disable Shopify/StrongerParameters for this repo since it doesn't use that gem / have controllers

I found this name confusing though. What we have here is a "group version resource" string that we're underspecifying in all cases except HPA. This string is really kubectl-specific (e.g. if we were using kubeclient, we'd need GV paths), so maybe a name that reflects that would be appropriate, e.g. kubectl_resource_type.

KnVerey · 2018-07-31T20:07:53Z

lib/kubernetes-deploy/kubernetes_resource/horizontal_pod_autoscaler.rb

+    def deploy_failed?
+      return false unless exists?
+      recoverable = RECOVERABLE_CONDITIONS.any? { |c| scaling_active_condition.fetch("reason", "").start_with?(c) }
+      able_to_scale_condition["status"] == "False" || (scaling_active_condition["status"] == "False" && !recoverable)


I don't think the "able to scale" condition is relevant at all (same in status). It's talking about whether or not the target in question can be scaled at this particular moment (based on things like whether or not it was just scaled a moment ago), not on the general validity of the autoscaling configuration. I see you have a test that implies that "able to scale - false; FailedGet" means the target resource may not exist, but couldn't failing on that cause a race condition, since the HPA doesn't get deployed after deployments? Couldn't it also mean the request for that resource (transiently) failed, i.e. should be excluded on the same grounds as other FailedGetXs? Is "Scaling active" actually true when the target doesn't exist?

I actually think FailedGetScale is relevant since when we are failing with that reason there isn't an 'AbleToScale' condition to look at. In theory its recoverable, but do we want the deploy to fail or timeout ?

KnVerey · 2018-07-31T20:26:56Z

lib/kubernetes-deploy/kubernetes_resource.rb

@@ -114,7 +114,7 @@ def file_path
    end

    def sync(mediator)
-      @instance_data = mediator.get_instance(type, name)
+      @instance_data = mediator.get_instance(fetch_type, name)


The sync mediator's caching should also use the new method, and we should make sure we add a test that would catch that.

https://github.com/Shopify/kubernetes-deploy/blob/7b47ed81f040922ad877fc7e3035f6b82a71ea87/lib/kubernetes-deploy/sync_mediator.rb#L39

KnVerey · 2018-07-31T20:28:26Z

lib/kubernetes-deploy/kubernetes_resource/horizontal_pod_autoscaler.rb

+      if !exists?
+        super
+      elsif deploy_succeeded?
+        "Succeeded"


"Configured"? "Succeeded" sounds like we're claiming it scaled something.

KnVerey · 2018-07-31T20:30:47Z

test/fixtures/hpa/hpa/deployment.yml

@@ -0,0 +1,24 @@
+apiVersion: apps/v1beta1


Why is this fixture set two layers deep? i.e. why not test/fixtures/hpa/thing.yml?

Edit: I see it's because the top-level dir contains another one with the metrics server config, which I commented on elsewhere

KnVerey · 2018-07-31T20:32:04Z

test/fixtures/hpa/hpa/deployment.yml

+metadata:
+  name: hpa-deployment
+  annotations:
+    shipit.shopify.io/restart: "true"


no need for this annotation

KnVerey · 2018-07-31T20:35:34Z

test/integration/kubernetes_deploy_test.rb

+
+  private
+
+  def deploy_metric_server


Sort of, yes. This is being deployed globally, so I think it should happen during the initial test suite setup, which is taken care of inside test_helper itself. I'd add it to this module:

https://github.com/Shopify/kubernetes-deploy/blob/7b47ed81f040922ad877fc7e3035f6b82a71ea87/test/test_helper.rb#L243-L254

And then call it with the PV setup, i.e. when the file is loaded

https://github.com/Shopify/kubernetes-deploy/blob/7b47ed81f040922ad877fc7e3035f6b82a71ea87/test/test_helper.rb#L289

KnVerey · 2018-07-31T20:39:30Z

test/integration/kubernetes_deploy_test.rb

+    # Set-up the metric server that the HPA needs https://github.com/kubernetes-incubator/metrics-server
+    ns = @namespace
+    @namespace = "kube-system"
+    assert_deploy_success(deploy_fixtures("hpa/kube-system", allow_protected_ns: true, prune: false))


hpa/kube-system isn't really a fixture set; it's more like test infra components. I'd move all those configs somewhere else, like test/setup/metrics-server or something. Maybe we should even use KubeClient or kubectl apply to create it... seems a little weird to have the test setup itself depend on the correct functioning of the core code.

KnVerey · 2018-07-31T20:55:45Z

lib/kubernetes-deploy/kubernetes_resource/horizontal_pod_autoscaler.rb

+# frozen_string_literal: true
+module KubernetesDeploy
+  class HorizontalPodAutoscaler < KubernetesResource
+    TIMEOUT = 5.minutes


Is this arbitrary or based on testing? 30s seemed short to me, but this seems really long 😄

dturn · 2018-07-31T21:07:33Z

Did you have to tweak the metrics server configs at all to get them working, or are they a direct copy-paste from https://github.com/kubernetes-incubator/metrics-server?

Direct copy-paste

dturn · 2018-08-01T18:22:30Z

lib/kubernetes-deploy/kubernetes_resource/horizontal_pod_autoscaler.rb

+# frozen_string_literal: true
+module KubernetesDeploy
+  class HorizontalPodAutoscaler < KubernetesResource
+    TIMEOUT = 3.minutes


One test deploy took 120.5s , this should give us enough buffer

KnVerey · 2018-08-01T20:40:41Z

test/integration/kubernetes_deploy_test.rb

+    skip if KUBE_SERVER_VERSION < Gem::Version.new('1.8.0')
+    assert_deploy_failure(deploy_fixtures("hpa", subset: ["hpa.yml"]), :timed_out)
+    assert_logs_match_all([
+      "Deploying HorizontalPodAutoscaler/hello-hpa (timeout: 180s)",


This test is guaranteed to take at least 180s 😢
There's some value in asserting that the message displayed on timeout is helpful, but I'm not sure it's worth 3 minutes of CI. Timeouts are fallback behaviour, not something we're generally striving for. I'd be inclined to replace this with a test for a case that actually fails.

I'm having a hard time figuring out how to make the HPA pass validation, have AbleToScale be true, and have ScalingActive false with out the cause being in RECOVERABLE_CONDITIONS.

The k8s tests (https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/podautoscaler/horizontal_test.go) weren't very instructional, any suggestions?

I can't find anything useful either. Kinda weird that so many conditions have been dedicated to situations that are not possible to create from the outside. My final observation is that in playing around with this locally, there seems to be a substantial delay before the conditions get populated, so maybe it would be safe to fail on ScalingActive/FailedGetThingMetric after all, i.e. the benefit of failing fast for those misconfigurations would outweigh the (small) possibility of a race condition causing spurious failures. Without much direct HPA experience, I'm not really sure what is best.

If we can't reproduce a failure scenario, maybe we can set a short timeout on the hpa resource? That won't really work if the condition message we'd look for in the logs can also take 3 minutes to appear though.

How would you feel about using Timecop.scale to make this test faster in wallclock time?

In general I dislike trading correctness for an optimization, failing fast.

KnVerey · 2018-08-01T20:42:21Z

lib/kubernetes-deploy/kubernetes_resource/horizontal_pod_autoscaler.rb

+        "Configured"
+      elsif scaling_active_condition.present? || able_to_scale_condition.present?
+        condition = scaling_active_condition.presence || able_to_scale_condition
+        "#{condition['reason']} (#{condition['message']})"


Looking at the test output, this makes for a realllly long status string. Usually those are a word or two. I'd suggest using the reason as the status and moving the message to failure_message/timeout_message when it is present. We're not currently setting those at all, and as a result we're getting the default "Kubernetes will continue to deploy this..." timeout message, which isn't great.

KnVerey · 2018-08-01T20:45:09Z

test/unit/sync_mediator_test.rb

+
+  class FakeHPA < MockResource
+    def kubectl_resource_type
+      'fakeHPA'


Isn't this the same as the value of type? If so the test isn't proving anything

[1] pry(#)> hpa.kubectl_resource_type
=> "fakeHPA"
[2] pry(#)> hpa.type
=> "FakeHPA"

I'll update to make it more clear.

dturn · 2018-08-03T22:00:51Z

1.8 is failing but not 1.9 or 1.10. Do you think https://github.com/kubernetes-incubator/metrics-server supports 1.7 and > 1.8 (not >=1.8)?

dturn · 2018-08-07T20:17:00Z

1.8 failing appears to be related to minikube minikube defaults that were changed in 1.9.

I was able to get the tests to pass by adding --extra-config=controller-manager.horizontal-pod-autoscaler-use-rest-clients=true to the minikube start command and then running minikube addons enable metrics-server. I'm going to disable the tests for 1.8 rather than mess with this more.

kubernetes/kubernetes#57673
https://stackoverflow.com/questions/48325627/minikube-horizontal-pod-autoscaling-error-unable-to-get-metrics-for-resource-cp

dturn requested review from klautcomputing and KnVerey June 27, 2018 21:46

dturn force-pushed the add-hpa branch from 837ecc2 to a24876c Compare June 28, 2018 17:04

karanthukral approved these changes Jul 9, 2018

View reviewed changes

timothysmith0609 approved these changes Jul 10, 2018

View reviewed changes

KnVerey suggested changes Jul 13, 2018

View reviewed changes

dturn force-pushed the add-hpa branch from d32e502 to 6f1c401 Compare July 17, 2018 17:07

KnVerey reviewed Jul 20, 2018

View reviewed changes

dturn force-pushed the add-hpa branch from 6f1c401 to 1ac9a3b Compare July 27, 2018 17:49

dturn commented Jul 27, 2018

View reviewed changes

KnVerey reviewed Jul 31, 2018

View reviewed changes

dturn force-pushed the add-hpa branch from c503ce7 to b77d756 Compare August 1, 2018 18:21

dturn commented Aug 1, 2018

View reviewed changes

KnVerey reviewed Aug 1, 2018

View reviewed changes

KnVerey approved these changes Aug 2, 2018

View reviewed changes

dturn force-pushed the add-hpa branch 2 times, most recently from c5c06ef to 4201e28 Compare August 3, 2018 17:51

dturn added 2 commits August 3, 2018 14:14

Add HPA

404b472

minikube doesn't get a gig of memory

76ee952

dturn force-pushed the add-hpa branch from e3fe17d to b8144e1 Compare August 3, 2018 21:15

Actually deploy metric server

14909bf

dturn force-pushed the add-hpa branch from b8144e1 to 14909bf Compare August 3, 2018 21:42

Skip hpa test in 1.8 due to minikube defaults

2c46447

dturn merged commit 86e5766 into master Aug 7, 2018

dturn deleted the add-hpa branch August 7, 2018 20:51

dturn mentioned this pull request Aug 8, 2018

Support more resource types #46

Closed

Add HPA #305

Add HPA #305

Conversation

dturn commented Jun 27, 2018 • edited Loading

karanthukral left a comment

Choose a reason for hiding this comment

timothysmith0609 left a comment • edited Loading

Choose a reason for hiding this comment

KnVerey commented Jul 13, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dturn commented Jul 17, 2018

KnVerey left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dturn commented Jul 27, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KnVerey left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dturn Aug 1, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KnVerey Jul 31, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dturn commented Jul 31, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dturn commented Aug 3, 2018

dturn commented Aug 7, 2018 • edited Loading

dturn commented Jun 27, 2018 •

edited

Loading

timothysmith0609 left a comment •

edited

Loading

dturn Aug 1, 2018 •

edited

Loading

KnVerey Jul 31, 2018 •

edited

Loading

dturn commented Aug 7, 2018 •

edited

Loading