Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjust Discovery throttling #3368

Merged
merged 4 commits into from Aug 12, 2022
Merged

Conversation

jashandeep-sohi
Copy link
Contributor

I'm using kpt live apply for a bunch of CRDs. However, as the number of CRDs increase, I'm finding it near impossible to use it because I'm getting throttled after every CRD is applied.

Turns out the client is doing a discovery process for every single CRD that's present on the api-server after every new CRD is applied. This is not a problem exclusive to kpt and there are some issues in the kubectl project to address this: kubernetes/kubectl#1126 kubernetes/kubernetes#105520 kubernetes/kubernetes#107131

Disabling it completely is probably not a good idea, but I just wanted to get the ball rolling to address this issue and adjust these parameters.

@google-cla
Copy link

google-cla bot commented Jul 14, 2022

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@droot
Copy link
Contributor

droot commented Jul 15, 2022

Thanks @jashandeep-sohi for the PR.

@karlkfi Can you pl. take a look this.

@karlkfi
Copy link
Contributor

karlkfi commented Jul 16, 2022

🤔

This should probably just be part of UpdateQPS, so that if server-side throttling is enabled the client-side discovery throttling is disabled.

I had assumed that disabling the QPS would also disable the discovery QPS, but I think the discovery QPS has its own separate default, and that’s why you’re seeing a delay.

How many CRDs are you applying in one package tho?

We might need to build a smarter discovery mapper to really solve this problem. Right now the cache invalidation is all or nothing and preemptively loads all the CRDs after each reset.

@jashandeep-sohi
Copy link
Contributor Author

thinking

This should probably just be part of UpdateQPS, so that if server-side throttling is enabled the client-side discovery throttling is disabled.

I had assumed that disabling the QPS would also disable the discovery QPS, but I think the discovery QPS has its own separate default, and that’s why you’re seeing a delay.

How many CRDs are you applying in one package tho?

We might need to build a smarter discovery mapper to really solve this problem. Right now the cache invalidation is all or nothing and preemptively loads all the CRDs after each reset.

It's not even that many CRDs at the moment. It's about 26 right now (all the cluster-api ones), but that's certainly expected to increase as we start using more crossplane ones.

I followed your lead, and my understanding is it's not primarily the number of CRDs you're applying that matters, but rather the total number of different API Groups that are already installed on the cluster:

Perhaps the real fix is better cache invalidation. Wouldn't it be enough to just invalidate a single API Group? i.e. the one that's being applied? Maybe I should also open up an issue upstream, but that seems like a much bigger change that'll take some time to make it's way through.

Regardless, I think adjusting these parameters (or providing a way to customize them at runtime) would be helpful.

@karlkfi
Copy link
Contributor

karlkfi commented Jul 16, 2022

Yeah, the upstream mapper is the root problem here. It doesn’t provide an API for invalidating a single API group, and it’s too aggressive about pre-loading because it’s designed to assume that a cache miss means it’s not registered on the server. There have been a number of proposals to fix this but none of them have been implemented yet.

On the easier side, we could expose the QPS config to the user. That feels like lazy UX, hoisting a solvable problem onto the user to configure. But maybe that’s preferable to not having a good solution in the short term. I know some Config Sync users that also want to be able to decrease the QPS to reduce server load without needing to configure QoS on every cluster.

@justinsb
Copy link
Contributor

I think this is a good idea, client side throttling is just not the right way for the apiserver to protect against overload, and we have priority in fairness in newer kubernetes versions - I believe all OSS supported kubernetes versions.

In kubebuilder-declarative-pattern we are experimenting with an alternative RESTMapper that doesn't fetch things we don't need, I'd also be in favor of reusing that: https://github.com/kubernetes-sigs/kubebuilder-declarative-pattern/tree/master/pkg/restmapper

As you point out @jashandeep-sohi it's hard to get this accepted upstream without demonstrating it first, so the strategy we seem to be converging on is copy-and-pasting the code between the various projects and letting them evolved independently (with occasional sync-ups), and then we can try to get this upstreamed into client-go or wherever it belongs best.

Thanks for the contribution @jashandeep-sohi :-) One problem is that it looks like you haven't signed the project's CLA ( https://github.com/GoogleContainerTools/kpt/pull/3368/checks?check_run_id=7372699029 ) - are you able to sign it?

Copy link
Contributor

@justinsb justinsb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, but we can't merge without the CLA

@karlkfi
Copy link
Contributor

karlkfi commented Jul 16, 2022

Actually I don’t think we want to disable QPS here. If you move it down into UpdateQPS tho that’d be ok. Then it only disables on newer clusters with P&F enabled.

@karlkfi
Copy link
Contributor

karlkfi commented Jul 16, 2022

we have priority in fairness in newer kubernetes versions - I believe all OSS supported kubernetes versions.

We still support v1.21 where P&F is alpha. GKE stable uses it still. That’s why UpdateQPS does the check to see if it’s enabled before disabling QPS client-side.

@karlkfi
Copy link
Contributor

karlkfi commented Jul 16, 2022

Oh, also, I’ve found the P&F docs to be inscrutable. Nobody knows how to configure it and there aren’t very many examples available. I’ve been meaning to interrogate the api machinery team and get some recommendations published but haven’t gotten around to it yet.

@jashandeep-sohi
Copy link
Contributor Author

@justinsb

Right, forgot about the CLA. Should be signed now.

And thanks for the pointers about getting things merged upstream. I also took a quick crack at improving the cache invalidation in forks:

It's certainly faster in my not-so-real-world test:

$ kind create cluster --name kpt-test
$ for i in {01..99}; do cat <<EOF; done > crds.yaml
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
 name: tests.group${i}.example.com
spec:
 group: group${i}.example.com
 versions:
  - name: v1
    served: true
    storage: true
    schema:
     openAPIV3Schema:
      type: object
      properties:
       spec:
        type: object
 scope: Namespaced
 names:
  plural: tests
  singular: tests
  kind: Test
---
EOF

# this also has discovery throttling disabled in addition to the above patches
$ curl -L 'https://github.com/jashandeep-sohi/kpt/releases/download/v1.0.0-beta.17-discovery-cache/kpt_linux_amd64' -o kpt-dc
$ chmod +x kpt-dc

$ kpt-dc live init
$ kpt-dc live install-resource-group

$ time kpt-dc live apply
$ time kpt-dc live destroy

# counting the number of requests per API endpoint
$ kpt-dc live apply -v 6 2>&1 | grep -i 'http' | cut -d ' ' -f 6-9 | sort | uniq -c
$ kpt-dc live destroy -v 6 2>&1 | grep -i 'http' | cut -d ' ' -f 6-9 | sort | uniq -c

@jashandeep-sohi
Copy link
Contributor Author

Actually I don’t think we want to disable QPS here. If you move it down into UpdateQPS tho that’d be ok. Then it only disables on newer clusters with P&F enabled.

I tried to do that as well, but as far as I can tell https://pkg.go.dev/k8s.io/client-go@v0.24.3/rest#Config doesn't expose anything related to discovery. It's being handled here https://github.com/kubernetes/cli-runtime/blob/4fdf49ae46a0caa7fafdfe97825c6129d5153f06/pkg/genericclioptions/config_flags.go#L269-L289 (which is also where I presume the values set by the UpdateQPS callback are overwritten).

@droot
Copy link
Contributor

droot commented Jul 25, 2022

Ping @karlkfi @justinsb Are we ready to merge this ?

@karlkfi
Copy link
Contributor

karlkfi commented Jul 25, 2022

No. It's not safe to disable discoveryQPS and discoveryBurst unless we know server-side throttling is enabled. That's what the code in UpdateQPS is doing.

Unfortunately, it means that lookup logic needs to be pulled out and done earlier, so it can influence QPS in both use cases, because ConfigFlags.toDiscoveryClient() overrides the values in QPS and BurstQPS with what's in discoveryQPS and discoveryBurst.

Copy link
Contributor

@karlkfi karlkfi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please modify UpdateQPS to take *ConfigFlags and set WrapConfigFn, DiscoveryBurst, and DiscoveryQPS after looking up whether server-side throttling is enabled.

@karlkfi
Copy link
Contributor

karlkfi commented Aug 6, 2022

Looks like this is now causing test failure in TestFnRender and TestFnEval.

Lots of logs like:

W0805 22:39:22.871596   75576 factory.go:71] Failed to query apiserver to check for flow control enablement: making /livez/ping request: dial tcp [::1]:8080: connect: connection refused

It's possible this is now trying to query the server before the connection is configured. I guess because WrapConfigFn used to delay the call until later in the process.

I wonder if WrapConfigFn could be injected with a pointer to the ConfigFlags and modify them when run, instead of just modifying the rest.Config. Would that work, or is that too late to modify the flags struct?

@jashandeep-sohi
Copy link
Contributor Author

Ok, wrapped everything in a closure and that fixed the tests. And I don't think it's too late because WrapConfigFn is called (in f.ToRestConfig) before Burst and QPS are set.

https://github.com/kubernetes/cli-runtime/blob/4fdf49ae46a0caa7fafdfe97825c6129d5153f06/pkg/genericclioptions/config_flags.go#L269-L289

Copy link
Contributor

@karlkfi karlkfi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

Thanks! This seems like a great improvement.

@karlkfi
Copy link
Contributor

karlkfi commented Aug 12, 2022

We've probably want to update kpt-config-sync too. And we JUST migrated the source of truth to GitHub yesterday for that, so we can accept PRs now. Is that something you're interested in doing now that we know the change to make? I think the code is very similar there, if it's not already using this function (which would mean needing a kpt dep bump after the next kpt release).

@karlkfi karlkfi merged commit 0dc6037 into kptdev:main Aug 12, 2022
@karlkfi
Copy link
Contributor

karlkfi commented Aug 12, 2022

chunglu-chou pushed a commit to chunglu-chou/kpt that referenced this pull request Aug 20, 2022
* Disable client-side throttling for resource discovery if server-side throttling is enabled
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants