Adjust Discovery throttling #3368

jashandeep-sohi · 2022-07-14T01:36:09Z

I'm using kpt live apply for a bunch of CRDs. However, as the number of CRDs increase, I'm finding it near impossible to use it because I'm getting throttled after every CRD is applied.

Turns out the client is doing a discovery process for every single CRD that's present on the api-server after every new CRD is applied. This is not a problem exclusive to kpt and there are some issues in the kubectl project to address this: kubernetes/kubectl#1126 kubernetes/kubernetes#105520 kubernetes/kubernetes#107131

Disabling it completely is probably not a good idea, but I just wanted to get the ball rolling to address this issue and adjust these parameters.

google-cla · 2022-07-14T01:36:13Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

droot · 2022-07-15T23:00:21Z

Thanks @jashandeep-sohi for the PR.

@karlkfi Can you pl. take a look this.

karlkfi · 2022-07-16T00:45:44Z

🤔

This should probably just be part of UpdateQPS, so that if server-side throttling is enabled the client-side discovery throttling is disabled.

I had assumed that disabling the QPS would also disable the discovery QPS, but I think the discovery QPS has its own separate default, and that’s why you’re seeing a delay.

How many CRDs are you applying in one package tho?

We might need to build a smarter discovery mapper to really solve this problem. Right now the cache invalidation is all or nothing and preemptively loads all the CRDs after each reset.

jashandeep-sohi · 2022-07-16T19:33:39Z

thinking

This should probably just be part of UpdateQPS, so that if server-side throttling is enabled the client-side discovery throttling is disabled.

I had assumed that disabling the QPS would also disable the discovery QPS, but I think the discovery QPS has its own separate default, and that’s why you’re seeing a delay.

How many CRDs are you applying in one package tho?

We might need to build a smarter discovery mapper to really solve this problem. Right now the cache invalidation is all or nothing and preemptively loads all the CRDs after each reset.

It's not even that many CRDs at the moment. It's about 26 right now (all the cluster-api ones), but that's certainly expected to increase as we start using more crossplane ones.

I followed your lead, and my understanding is it's not primarily the number of CRDs you're applying that matters, but rather the total number of different API Groups that are already installed on the cluster:

Apply the CRDs w/ kpt
On every CRD update this resets the RESTMapper e.g. https://github.com/kubernetes-sigs/cli-utils/blob/0cb95eeda69d254472d3afc81dacfb0d9344fcd0/pkg/kstatus/watcher/object_status_reporter.go#L576-L589
Which eventually makes its way to https://github.com/kubernetes/client-go/blob/cab7ba1d4a523956b6395dcbe38620159ac43fef/discovery/cached/disk/cached_discovery.go#L266-L274
And that's basically the equivalent of rm -rf $HOME/.kube/cache/discovery/$server/* causing the discovery client to repopulate all of the CRD groups installed (in addition to the one that triggered the invalidation). This happens on every CRD apply!

Perhaps the real fix is better cache invalidation. Wouldn't it be enough to just invalidate a single API Group? i.e. the one that's being applied? Maybe I should also open up an issue upstream, but that seems like a much bigger change that'll take some time to make it's way through.

Regardless, I think adjusting these parameters (or providing a way to customize them at runtime) would be helpful.

karlkfi · 2022-07-16T19:49:21Z

Yeah, the upstream mapper is the root problem here. It doesn’t provide an API for invalidating a single API group, and it’s too aggressive about pre-loading because it’s designed to assume that a cache miss means it’s not registered on the server. There have been a number of proposals to fix this but none of them have been implemented yet.

On the easier side, we could expose the QPS config to the user. That feels like lazy UX, hoisting a solvable problem onto the user to configure. But maybe that’s preferable to not having a good solution in the short term. I know some Config Sync users that also want to be able to decrease the QPS to reduce server load without needing to configure QoS on every cluster.

justinsb · 2022-07-16T19:54:43Z

I think this is a good idea, client side throttling is just not the right way for the apiserver to protect against overload, and we have priority in fairness in newer kubernetes versions - I believe all OSS supported kubernetes versions.

In kubebuilder-declarative-pattern we are experimenting with an alternative RESTMapper that doesn't fetch things we don't need, I'd also be in favor of reusing that: https://github.com/kubernetes-sigs/kubebuilder-declarative-pattern/tree/master/pkg/restmapper

As you point out @jashandeep-sohi it's hard to get this accepted upstream without demonstrating it first, so the strategy we seem to be converging on is copy-and-pasting the code between the various projects and letting them evolved independently (with occasional sync-ups), and then we can try to get this upstreamed into client-go or wherever it belongs best.

Thanks for the contribution @jashandeep-sohi :-) One problem is that it looks like you haven't signed the project's CLA ( https://github.com/GoogleContainerTools/kpt/pull/3368/checks?check_run_id=7372699029 ) - are you able to sign it?

justinsb

lgtm, but we can't merge without the CLA

karlkfi · 2022-07-16T22:17:01Z

Actually I don’t think we want to disable QPS here. If you move it down into UpdateQPS tho that’d be ok. Then it only disables on newer clusters with P&F enabled.

karlkfi · 2022-07-16T22:19:52Z

we have priority in fairness in newer kubernetes versions - I believe all OSS supported kubernetes versions.

We still support v1.21 where P&F is alpha. GKE stable uses it still. That’s why UpdateQPS does the check to see if it’s enabled before disabling QPS client-side.

karlkfi · 2022-07-16T22:21:25Z

Oh, also, I’ve found the P&F docs to be inscrutable. Nobody knows how to configure it and there aren’t very many examples available. I’ve been meaning to interrogate the api machinery team and get some recommendations published but haven’t gotten around to it yet.

jashandeep-sohi · 2022-07-16T23:02:06Z

@justinsb

Right, forgot about the CLA. Should be signed now.

And thanks for the pointers about getting things merged upstream. I also took a quick crack at improving the cache invalidation in forks:

It's certainly faster in my not-so-real-world test:

$ kind create cluster --name kpt-test
$ for i in {01..99}; do cat <<EOF; done > crds.yaml
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
 name: tests.group${i}.example.com
spec:
 group: group${i}.example.com
 versions:
  - name: v1
    served: true
    storage: true
    schema:
     openAPIV3Schema:
      type: object
      properties:
       spec:
        type: object
 scope: Namespaced
 names:
  plural: tests
  singular: tests
  kind: Test
---
EOF

# this also has discovery throttling disabled in addition to the above patches
$ curl -L 'https://github.com/jashandeep-sohi/kpt/releases/download/v1.0.0-beta.17-discovery-cache/kpt_linux_amd64' -o kpt-dc
$ chmod +x kpt-dc

$ kpt-dc live init
$ kpt-dc live install-resource-group

$ time kpt-dc live apply
$ time kpt-dc live destroy

# counting the number of requests per API endpoint
$ kpt-dc live apply -v 6 2>&1 | grep -i 'http' | cut -d ' ' -f 6-9 | sort | uniq -c
$ kpt-dc live destroy -v 6 2>&1 | grep -i 'http' | cut -d ' ' -f 6-9 | sort | uniq -c

jashandeep-sohi · 2022-07-16T23:15:47Z

Actually I don’t think we want to disable QPS here. If you move it down into UpdateQPS tho that’d be ok. Then it only disables on newer clusters with P&F enabled.

I tried to do that as well, but as far as I can tell https://pkg.go.dev/k8s.io/client-go@v0.24.3/rest#Config doesn't expose anything related to discovery. It's being handled here https://github.com/kubernetes/cli-runtime/blob/4fdf49ae46a0caa7fafdfe97825c6129d5153f06/pkg/genericclioptions/config_flags.go#L269-L289 (which is also where I presume the values set by the UpdateQPS callback are overwritten).

droot · 2022-07-25T19:31:49Z

Ping @karlkfi @justinsb Are we ready to merge this ?

karlkfi · 2022-07-25T20:32:39Z

No. It's not safe to disable discoveryQPS and discoveryBurst unless we know server-side throttling is enabled. That's what the code in UpdateQPS is doing.

Unfortunately, it means that lookup logic needs to be pulled out and done earlier, so it can influence QPS in both use cases, because ConfigFlags.toDiscoveryClient() overrides the values in QPS and BurstQPS with what's in discoveryQPS and discoveryBurst.

karlkfi

Please modify UpdateQPS to take *ConfigFlags and set WrapConfigFn, DiscoveryBurst, and DiscoveryQPS after looking up whether server-side throttling is enabled.

karlkfi · 2022-08-06T00:33:22Z

Looks like this is now causing test failure in TestFnRender and TestFnEval.

Lots of logs like:

W0805 22:39:22.871596   75576 factory.go:71] Failed to query apiserver to check for flow control enablement: making /livez/ping request: dial tcp [::1]:8080: connect: connection refused

It's possible this is now trying to query the server before the connection is configured. I guess because WrapConfigFn used to delay the call until later in the process.

I wonder if WrapConfigFn could be injected with a pointer to the ConfigFlags and modify them when run, instead of just modifying the rest.Config. Would that work, or is that too late to modify the flags struct?

jashandeep-sohi · 2022-08-06T20:29:14Z

Ok, wrapped everything in a closure and that fixed the tests. And I don't think it's too late because WrapConfigFn is called (in f.ToRestConfig) before Burst and QPS are set.

https://github.com/kubernetes/cli-runtime/blob/4fdf49ae46a0caa7fafdfe97825c6129d5153f06/pkg/genericclioptions/config_flags.go#L269-L289

karlkfi

/lgtm
/approve

Thanks! This seems like a great improvement.

karlkfi · 2022-08-12T18:40:38Z

We've probably want to update kpt-config-sync too. And we JUST migrated the source of truth to GitHub yesterday for that, so we can accept PRs now. Is that something you're interested in doing now that we know the change to make? I think the code is very similar there, if it's not already using this function (which would mean needing a kpt dep bump after the next kpt release).

karlkfi · 2022-08-12T18:59:32Z

Looks like CS has its own copy: https://github.com/GoogleContainerTools/kpt-config-sync/blob/main/pkg/client/restconfig/restconfig.go#L64

* Disable client-side throttling for resource discovery if server-side throttling is enabled

disable discovery throttling

f68faaf

jashandeep-sohi requested review from droot, justinsb, mengqiy, mortent, natasha41575 and yuwenma as code owners July 14, 2022 01:36

droot assigned karlkfi Jul 15, 2022

justinsb approved these changes Jul 16, 2022

View reviewed changes

karlkfi suggested changes Jul 25, 2022

View reviewed changes

only disable discovery QPS & burst if server-side throttling is enabled

0bb53a9

jashandeep-sohi requested a review from karlkfi August 3, 2022 02:55

jashandeep-sohi added 2 commits August 6, 2022 13:07

address failing tests

21ad8ab

Merge branch 'main' into discovery-throttle

edc1610

karlkfi approved these changes Aug 12, 2022

View reviewed changes

karlkfi merged commit 0dc6037 into kptdev:main Aug 12, 2022

negz mentioned this pull request Aug 17, 2022

Disable client-side rate-limiting when AP&F is enabled kubernetes/kubernetes#111880

Open

chunglu-chou pushed a commit to chunglu-chou/kpt that referenced this pull request Aug 20, 2022

Adjust Discovery throttling (kptdev#3368)

cb9d735

* Disable client-side throttling for resource discovery if server-side throttling is enabled

This was referenced Aug 27, 2022

Encourage Kubernetes clients to bump discovery rate limits to 300+ rps crossplane/crossplane#3272

Closed

Bump default discovery rate limit to 300rps helm/helm#11280

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adjust Discovery throttling #3368

Adjust Discovery throttling #3368

jashandeep-sohi commented Jul 14, 2022

google-cla bot commented Jul 14, 2022

droot commented Jul 15, 2022

karlkfi commented Jul 16, 2022

jashandeep-sohi commented Jul 16, 2022

karlkfi commented Jul 16, 2022

justinsb commented Jul 16, 2022

justinsb left a comment

karlkfi commented Jul 16, 2022

karlkfi commented Jul 16, 2022

karlkfi commented Jul 16, 2022

jashandeep-sohi commented Jul 16, 2022

jashandeep-sohi commented Jul 16, 2022

droot commented Jul 25, 2022

karlkfi commented Jul 25, 2022

karlkfi left a comment

karlkfi commented Aug 6, 2022

jashandeep-sohi commented Aug 6, 2022

karlkfi left a comment

karlkfi commented Aug 12, 2022

karlkfi commented Aug 12, 2022

Adjust Discovery throttling #3368

Adjust Discovery throttling #3368

Conversation

jashandeep-sohi commented Jul 14, 2022

google-cla bot commented Jul 14, 2022

droot commented Jul 15, 2022

karlkfi commented Jul 16, 2022

jashandeep-sohi commented Jul 16, 2022

karlkfi commented Jul 16, 2022

justinsb commented Jul 16, 2022

justinsb left a comment

Choose a reason for hiding this comment

karlkfi commented Jul 16, 2022

karlkfi commented Jul 16, 2022

karlkfi commented Jul 16, 2022

jashandeep-sohi commented Jul 16, 2022

jashandeep-sohi commented Jul 16, 2022

droot commented Jul 25, 2022

karlkfi commented Jul 25, 2022

karlkfi left a comment

Choose a reason for hiding this comment

karlkfi commented Aug 6, 2022

jashandeep-sohi commented Aug 6, 2022

karlkfi left a comment

Choose a reason for hiding this comment

karlkfi commented Aug 12, 2022

karlkfi commented Aug 12, 2022