Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add HPA #305

Merged
merged 4 commits into from
Aug 7, 2018
Merged

Add HPA #305

merged 4 commits into from
Aug 7, 2018

Conversation

dturn
Copy link
Contributor

@dturn dturn commented Jun 27, 2018

Add HPA Resource (part of splitting of https://github.com/Shopify/kubernetes-deploy/pull/188/files)

In v2beta1 we get access to threecondition statusesAbleToScale, ScalingActive , and ScalingLimited. I think it makes sense to succeeded when AbleToScale is true.

The first, AbleToScale, indicates whether or not the HPA is able to fetch and update scales, as well as whether or not any backoff-related conditions would prevent scaling.

The downside of using v2beta is no support in 1.7. I think its worth the tradeoff

Copy link
Contributor

@karanthukral karanthukral left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@timothysmith0609 timothysmith0609 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I would suggest using the ScalingActive condition to determine deploy success. As the comment around its definition states:

ScalingActive indicates that the HPA controller is able to scale if necessary:
it's correctly configured, can fetch the desired metrics, and isn't disabled.

Though I don't particularly feel strongly one way or the other between this and AbleToScale

@KnVerey
Copy link
Contributor

KnVerey commented Jul 13, 2018

I found the difference between the intent of ScalingActive and AbleToScale a bit confusing too, so I looked up all the possible reasons for them (this file):

AbleToScale

  • FailedGetScale: the HPA controller was unable to get the target's current scale
  • BackoffDownscale: the time since the previous scale is still within the downscale forbidden window
  • BackoffBoth: the time since the previous scale is still within both the downscale and upscale forbidden windows
  • BackoffUpscale: the time since the previous scale is still within the upscale forbidden window
  • ReadyForNewScale: the last scale time was sufficiently old as to warrant a new scale
  • FailedUpdateScale: the HPA controller was unable to update the target scale
  • SucceededRescale: the HPA controller was able to update the target scale to %d

ScalingActive

  • InvalidSelector: the HPA target's scale is missing a selector
  • InvalidSelector: couldn't convert selector into a corresponding internal selector object
  • FailedGetObjectMetric/FailedGetPodsMetric/FailedGetResourceMetric/FailedGetExternalMetric: the HPA was unable to compute the replica count
  • InvalidMetricSourceType: the HPA was unable to compute the replica count
  • ValidMetricFound: the HPA was able to successfully calculate a replica count from %s
  • ScalingDisabled: scaling is disabled since the replica count of the target is zero

Based on that, AbleToScale is talking about ability to scale right now in particular, whereas ScalingActive tells us whether the HPA is correctly configured. So the latter is what we want. We should probably ignore false values with that last reason though, or else deploys will fail when people using HPAs scale to zero manually.

# frozen_string_literal: true
module KubernetesDeploy
class HorizontalPodAutoscaler < KubernetesResource
PRUNABLE = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do like the idea of defining this at the class level, but this doesn't do anything right now. Your test passes because, oddly enough, HPA was already on the whitelist.

end

def deploy_failed?
!exists?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be checking the ScalingActive condition too (but should the deploy should succeed when that's false because ScalingDisabled).

end

def timeout_message
UNUSUAL_FAILURE_MESSAGE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this message (It is very unusual for this resource type to fail to deploy. Please try the deploy again. If that new deploy also fails, contact your cluster administrator.) is fitting, since HPAs have legit failure mode (unlike say configmaps).

end

def type
'hpa.v2beta1.autoscaling'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this? Don't we want it called HorizontalPodAutoscaler?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do, but this method determines how kubectl fetches the resource. And we don't get the status conditions until v2beta1. Do we want to add a new method?

metadata:
labels:
name: web
app: hello-cloud
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: these labels should be "hpa" and "hpa-deployment"

@@ -1006,4 +1006,21 @@ def test_raise_on_yaml_missing_kind
" datapoint: value1"
], in_order: true)
end

def test_hpa_can_be_successful
skip if KUBE_SERVER_VERSION < Gem::Version.new('1.8.0')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your PR description says The downside of using v2beta is no support in 1.8.; did you mean 1.7? If that's true, our unofficial deprecation policy (following GKE) would allow us to drop 1.7 at this point, and then the hpa could be folded into hello-cloud, as could the cronjob tests. If that isn't the case (i.e. the conditions aren't present in a version we need to support), don't we need to have alternate success/failure conditions in the code itself rather than here, to avoid breaking deploys for people with hpas on that version?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was added in 1.8 and I've updated the description. What does dropping support for 1.7 mean past not running 1.7 CI?

@@ -25,4 +25,8 @@ def apps_v1beta1_kubeclient
def batch_v1beta1_kubeclient
@batch_v1beta1_kubeclient ||= build_batch_v1beta1_kubeclient(MINIKUBE_CONTEXT)
end

def autoscaling_v2beta1_kubeclient
@autoscaling_v2beta1_kubeclient ||= build_autoscaling_v2beta1_kubeclient(MINIKUBE_CONTEXT)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not seeing these clients being used


def test_hpa_can_be_successful
skip if KUBE_SERVER_VERSION < Gem::Version.new('1.8.0')
assert_deploy_success(deploy_fixtures("hpa"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please try to include at least one relevant logging assertion with your tests. The logs are our UI, so we should always check (PRINT_LOGS=1) what they look like and assert that we haven't broken them. Typically I'll assert on what gets printed about the resource in the summary

@dturn
Copy link
Contributor Author

dturn commented Jul 17, 2018

It looks like I used AbleToScale vs ScalingActive because scalingActive is false in minikube:

            "status": {
                "conditions": [
                    {
                        "lastTransitionTime": "2018-07-17T00:15:37Z",
                        "message": "the HPA controller was able to get the target's current scale",
                        "reason": "SucceededGetScale",
                        "status": "True",
                        "type": "AbleToScale"
                    },
                    {
                        "lastTransitionTime": "2018-07-17T00:15:37Z",
                        "message": "the HPA was unable to compute the replica count: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server could not find the requested resource (get pods.metrics.k8s.io)",
                        "reason": "FailedGetResourceMetric",
                        "status": "False",
                        "type": "ScalingActive"
                    }
                ],
                "currentMetrics": null,
                "currentReplicas": 1,
                "desiredReplicas": 0
            }
        }

Copy link
Contributor

@KnVerey KnVerey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like I used AbleToScale vs ScalingActive because scalingActive is false in minikube

Doesn't that mean that HPAs don't work in minikube? I don't think that's a good reason not to use the correct condition, even if testing will be a problem. Did you dig into the cause of that minikube problem? This issue seems to suggest the problem could be that the resource in question doesn't have req/limits set properly. Could that be the case with the fixture you're using?

We do, but this method determines how kubectl fetches the resource. And we don't get the status conditions until v2beta1. Do we want to add a new method?

I guess so? Perhaps we should consider pinning our access GVs in general. 🤔 Bottom line for this PR though is I don't think we should print an unfriendly gvk string as part of the name in our output.

It was added in 1.8 and I've updated the description. What does dropping support for 1.7 mean past not running 1.7 CI?

That's pretty much all it means (well, that and documenting it in our readme, and removing any test workarounds we have in place to support 1.7).

# frozen_string_literal: true
module KubernetesDeploy
class HorizontalPodAutoscaler < KubernetesResource
TIMEOUT = 30.seconds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make this a little longer? Seemingly some metrics check is happening, and I have no idea how long that can take.

@dturn
Copy link
Contributor Author

dturn commented Jul 27, 2018

Turns out the issue was kubernetes/kubernetes#57673 e.g. that we need to deploy the metric server from https://github.com/kubernetes-incubator/metrics-server.

This is ready for 👀 again


private

def deploy_metric_server
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a helper file we've been putting methods like this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sort of, yes. This is being deployed globally, so I think it should happen during the initial test suite setup, which is taken care of inside test_helper itself. I'd add it to this module:

https://github.com/Shopify/kubernetes-deploy/blob/7b47ed81f040922ad877fc7e3035f6b82a71ea87/test/test_helper.rb#L243-L254

And then call it with the PV setup, i.e. when the file is loaded

https://github.com/Shopify/kubernetes-deploy/blob/7b47ed81f040922ad877fc7e3035f6b82a71ea87/test/test_helper.rb#L289

Copy link
Contributor

@KnVerey KnVerey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you have to tweak the metrics server configs at all to get them working, or are they a direct copy-paste from https://github.com/kubernetes-incubator/metrics-server?

@@ -146,6 +146,10 @@ def type
@type || self.class.kind
end

def fetch_type
type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: the policial check, we can disable Shopify/StrongerParameters for this repo since it doesn't use that gem / have controllers

I found this name confusing though. What we have here is a "group version resource" string that we're underspecifying in all cases except HPA. This string is really kubectl-specific (e.g. if we were using kubeclient, we'd need GV paths), so maybe a name that reflects that would be appropriate, e.g. kubectl_resource_type.

def deploy_failed?
return false unless exists?
recoverable = RECOVERABLE_CONDITIONS.any? { |c| scaling_active_condition.fetch("reason", "").start_with?(c) }
able_to_scale_condition["status"] == "False" || (scaling_active_condition["status"] == "False" && !recoverable)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the "able to scale" condition is relevant at all (same in status). It's talking about whether or not the target in question can be scaled at this particular moment (based on things like whether or not it was just scaled a moment ago), not on the general validity of the autoscaling configuration. I see you have a test that implies that "able to scale - false; FailedGet" means the target resource may not exist, but couldn't failing on that cause a race condition, since the HPA doesn't get deployed after deployments? Couldn't it also mean the request for that resource (transiently) failed, i.e. should be excluded on the same grounds as other FailedGetXs? Is "Scaling active" actually true when the target doesn't exist?

Copy link
Contributor Author

@dturn dturn Aug 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think FailedGetScale is relevant since when we are failing with that reason there isn't an 'AbleToScale' condition to look at. In theory its recoverable, but do we want the deploy to fail or timeout ?

@@ -114,7 +114,7 @@ def file_path
end

def sync(mediator)
@instance_data = mediator.get_instance(type, name)
@instance_data = mediator.get_instance(fetch_type, name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sync mediator's caching should also use the new method, and we should make sure we add a test that would catch that.

https://github.com/Shopify/kubernetes-deploy/blob/7b47ed81f040922ad877fc7e3035f6b82a71ea87/lib/kubernetes-deploy/sync_mediator.rb#L39

if !exists?
super
elsif deploy_succeeded?
"Succeeded"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Configured"? "Succeeded" sounds like we're claiming it scaled something.

@@ -0,0 +1,24 @@
apiVersion: apps/v1beta1
Copy link
Contributor

@KnVerey KnVerey Jul 31, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this fixture set two layers deep? i.e. why not test/fixtures/hpa/thing.yml?

Edit: I see it's because the top-level dir contains another one with the metrics server config, which I commented on elsewhere

metadata:
name: hpa-deployment
annotations:
shipit.shopify.io/restart: "true"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for this annotation


private

def deploy_metric_server
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sort of, yes. This is being deployed globally, so I think it should happen during the initial test suite setup, which is taken care of inside test_helper itself. I'd add it to this module:

https://github.com/Shopify/kubernetes-deploy/blob/7b47ed81f040922ad877fc7e3035f6b82a71ea87/test/test_helper.rb#L243-L254

And then call it with the PV setup, i.e. when the file is loaded

https://github.com/Shopify/kubernetes-deploy/blob/7b47ed81f040922ad877fc7e3035f6b82a71ea87/test/test_helper.rb#L289

# Set-up the metric server that the HPA needs https://github.com/kubernetes-incubator/metrics-server
ns = @namespace
@namespace = "kube-system"
assert_deploy_success(deploy_fixtures("hpa/kube-system", allow_protected_ns: true, prune: false))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hpa/kube-system isn't really a fixture set; it's more like test infra components. I'd move all those configs somewhere else, like test/setup/metrics-server or something. Maybe we should even use KubeClient or kubectl apply to create it... seems a little weird to have the test setup itself depend on the correct functioning of the core code.

# frozen_string_literal: true
module KubernetesDeploy
class HorizontalPodAutoscaler < KubernetesResource
TIMEOUT = 5.minutes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this arbitrary or based on testing? 30s seemed short to me, but this seems really long 😄

@dturn
Copy link
Contributor Author

dturn commented Jul 31, 2018

Did you have to tweak the metrics server configs at all to get them working, or are they a direct copy-paste from https://github.com/kubernetes-incubator/metrics-server?

Direct copy-paste

# frozen_string_literal: true
module KubernetesDeploy
class HorizontalPodAutoscaler < KubernetesResource
TIMEOUT = 3.minutes
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One test deploy took 120.5s , this should give us enough buffer

skip if KUBE_SERVER_VERSION < Gem::Version.new('1.8.0')
assert_deploy_failure(deploy_fixtures("hpa", subset: ["hpa.yml"]), :timed_out)
assert_logs_match_all([
"Deploying HorizontalPodAutoscaler/hello-hpa (timeout: 180s)",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is guaranteed to take at least 180s 😢
There's some value in asserting that the message displayed on timeout is helpful, but I'm not sure it's worth 3 minutes of CI. Timeouts are fallback behaviour, not something we're generally striving for. I'd be inclined to replace this with a test for a case that actually fails.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having a hard time figuring out how to make the HPA pass validation, have AbleToScale be true, and have ScalingActive false with out the cause being in RECOVERABLE_CONDITIONS.

The k8s tests (https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/podautoscaler/horizontal_test.go) weren't very instructional, any suggestions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't find anything useful either. Kinda weird that so many conditions have been dedicated to situations that are not possible to create from the outside. My final observation is that in playing around with this locally, there seems to be a substantial delay before the conditions get populated, so maybe it would be safe to fail on ScalingActive/FailedGetThingMetric after all, i.e. the benefit of failing fast for those misconfigurations would outweigh the (small) possibility of a race condition causing spurious failures. Without much direct HPA experience, I'm not really sure what is best.

If we can't reproduce a failure scenario, maybe we can set a short timeout on the hpa resource? That won't really work if the condition message we'd look for in the logs can also take 3 minutes to appear though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would you feel about using Timecop.scale to make this test faster in wallclock time?

In general I dislike trading correctness for an optimization, failing fast.

"Configured"
elsif scaling_active_condition.present? || able_to_scale_condition.present?
condition = scaling_active_condition.presence || able_to_scale_condition
"#{condition['reason']} (#{condition['message']})"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the test output, this makes for a realllly long status string. Usually those are a word or two. I'd suggest using the reason as the status and moving the message to failure_message/timeout_message when it is present. We're not currently setting those at all, and as a result we're getting the default "Kubernetes will continue to deploy this..." timeout message, which isn't great.


class FakeHPA < MockResource
def kubectl_resource_type
'fakeHPA'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this the same as the value of type? If so the test isn't proving anything

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[1] pry(#)> hpa.kubectl_resource_type
=> "fakeHPA"
[2] pry(#)> hpa.type
=> "FakeHPA"

I'll update to make it more clear.

@dturn dturn force-pushed the add-hpa branch 2 times, most recently from c5c06ef to 4201e28 Compare August 3, 2018 17:51
@dturn
Copy link
Contributor Author

dturn commented Aug 3, 2018

1.8 is failing but not 1.9 or 1.10. Do you think https://github.com/kubernetes-incubator/metrics-server supports 1.7 and > 1.8 (not >=1.8)?

@dturn
Copy link
Contributor Author

dturn commented Aug 7, 2018

1.8 failing appears to be related to minikube minikube defaults that were changed in 1.9.

I was able to get the tests to pass by adding --extra-config=controller-manager.horizontal-pod-autoscaler-use-rest-clients=true to the minikube start command and then running minikube addons enable metrics-server. I'm going to disable the tests for 1.8 rather than mess with this more.

kubernetes/kubernetes#57673
https://stackoverflow.com/questions/48325627/minikube-horizontal-pod-autoscaling-error-unable-to-get-metrics-for-resource-cp

@dturn dturn merged commit 86e5766 into master Aug 7, 2018
@dturn dturn deleted the add-hpa branch August 7, 2018 20:51
@dturn dturn mentioned this pull request Aug 8, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants