Skip to content

add workerCount config and multiple worker supports for istio_gateway and istio_virtualservice #5473

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

sichenzhao
Copy link

What does it do ?

#5458

Motivation

#5458

More

  • Yes, this PR title follows Conventional Commits
  • Yes, I added unit tests
  • Yes, I updated end user documentation accordingly

Copy link

linux-foundation-easycla bot commented May 27, 2025

CLA Signed


The committers listed above are authorized under a signed CLA.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign mloiseleur for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested a review from szuecs May 27, 2025 04:07
@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels May 27, 2025
@k8s-ci-robot
Copy link
Contributor

Welcome @sichenzhao!

It looks like this is your first PR to kubernetes-sigs/external-dns 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/external-dns has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label May 27, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @sichenzhao. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 27, 2025
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels May 27, 2025
@ivankatliarchuk
Copy link
Contributor

I didn’t spend too much time on this, so my analyse could be incorrect, but I wanted to provide an overview of why the current implementation is slow and why proposed solution may not necessary provide performance benefits.


Thank you for taking the time to improve the code. It's never easy—especially when the original version was written to simply work, without performance in mind.


General feedback

While I agree that performance improvements are needed, the current approach feels somewhat naive—it's more of a brute-force solution: add more CPU to gain improvement.

Instead, I’d encourage starting with:

  • A performance audit
  • A clear refactoring strategy

The guiding principles should be:

  • Fast
  • Simple
  • Proven — with benchmarking tests showing actual performance gains

My perspective

You may not fully agree with my view, but I’m approaching this from a different angle. I see several concerns with the current solution:

  • ❌ No clear evidence that adding workers improves performance
  • ❌ Code complexity has roughly doubled, increasing maintenance burden
  • ❌ The cyclomatic and runtime complexity spans multiple layers and remains unresolved
  • ⚠️ There is a great potentialfor optimizations—especially if caching or short-circuiting can reduce inner loops.
  • ⚠️Jumping straight into multi threading not necessary provides a performance benefit while adding up other costs.

Regarding Endpoints method

The time complexity of the Endpoints method is tied to nested loops and the operations inside them. Here’s the breakdown:

  1. Listing Gateways:
    -sc.istioClient.NetworkingV1alpha3().Gateways(sc.namespace).List(...) retrieves all gateways.

    • Assume this operation is O(n), where n is the number of gateways.
  2. Filtering by Annotations:

    • sc.filterByAnnotations(gateways) iterates over all gateways.
    • Time complexity: O(n).
  3. Processing Gateways:

  • The outer loop iterates over the filtered gateways (O(n)).
  • For each gateway:
    • sc.hostNamesFromGateway(gateway) processes gateway.Spec.Servers and their Hosts.
    • Assume there are m servers and h hosts per server.
  • Time complexity: O(m * h).
    • sc.endpointsFromGateway(...) processes hostnames (O(k), where k is the number of hostnames).
    • Inside this, sc.targetsFromGateway(...) may involve additional loops over services or ingress objects.
  1. Sorting Endpoints:
  • The final loop sorts the Targets for each endpoint.
  • Assume there are e endpoints, and sorting each takes O(t log t), where t is the number of targets per endpoint.
  • Total complexity: O(e * t log t).

Overall Time Complexity:
Combining these, the total time complexity is approximately:

O(n) + O(n) + O(n * (m * h + k)) + O(e * t log t)

Where:

  • n = number of gateways.
  • m = number of servers per gateway.
  • h = number of hosts per server.
  • k = number of hostnames per gateway.
  • e = number of endpoints.
  • t = number of targets per endpoint

So I've created a visual for myself to understand it better

Loading
graph TD
    A[Endpoints Method] --> B[Loop over gateways]
    B --> C[Filter by annotations]
    C --> D[Loop over filtered gateways]
    D --> E[Extract hostnames from gateway]
    E --> F[Loop over servers in gateway]
    F --> G[Loop over hosts in server]
    D --> H[Generate endpoints from gateway]
    H --> I[Loop over hostnames]
    I --> J[Extract targets from gateway]
    J --> K[Loop over ingress load balancer]
    J --> L[Loop over services]
    A --> M[Loop over endpoints to sort targets]

if mermaid not visible here

Screenshot 2025-05-27 at 09 03 53

Main chain of nested logic:

  • Loop over gateways (top-level Gateway objects): O(g)
  • Filter by annotations (might loop over annotations per gateway): O(a) per gateway
  • Loop over filtered gateways: up to O(g) again
  • Loop over servers in gateway: O(s) per gateway
  • Loop over hosts in server: O(h) per server
  • Loop over hostnames (hostname annotation and match rule hosts): O(h')
  • Loop over ingress load balancer entries: O(lb) per target
  • Loop over services (possibly EndpointSlice or similar): O(svc)

So potential bottlenecks

  • FilterByAnnotations() Full scan on annotations, maybe unnecessary if rarely used, or could be improvement with different data structure
  • ForEach Gateway → Server → Host Deep nesting, repeated strings processings
  • GetTargetsFromGateway() Repeated LB & service resolution
  • Sort Targets Happens for every endpoint, even when sorting is not always required

I'm not sure, that wrapping all code in go routine is the right approach at the moment.

We all see it different, but this is potential improvements

  1. Reduce nesting in for loops, instead of nest, unwrap them
  2. Avoid Redundant Filtering. Move annotation-based filtering before iterating servers and hosts. If a gateway is irrelevant, skip it early.
  3. Extract & Cache Common Hostnames. Use a helper to extract hostnames once per Gateway, rather than repeating in multiple subcalls
  4. Improve functions like TargetsFromTargetAnnotation and similar methods, that have for loops

There are quite few I/O operations, that may take a while to respond, every call to kubernets API ads up time to overal performance, slicing them, wrapping in go routines could help, but this may kill/throttle a kubernetes API server on the other side. So there are pros/cons as well to optimise them

  • sc.istioClient.NetworkingV1alpha3().Gateways(sc.namespace).List(ctx, metav1.ListOptions{})
  • sc.kubeClient.NetworkingV1().Ingresses(namespace).Get(ctx, name, metav1.GetOptions{})
  • svcInformer.Lister().Services(namespace).List(labels.Everything())
  • .... probably more ....

@szuecs
Copy link
Contributor

szuecs commented May 30, 2025

just FYI #5458 (comment)

@incfly
Copy link

incfly commented Jun 4, 2025

@szuecs Thanks, replied it here. #5458 (comment)

Hi @ivankatliarchuk I don't think the issue we are facing is about algorithm level complexity. If there're N number VirtualServices/Gateways resources, etc, we have to process all of them. It won't be that slow if it's pure computation.
The actual slowness comes from the sequentially processing virtual service: and each processing does both computation and IO (API server request like here, https://github.com/kubernetes-sigs/external-dns/blob/master/source/istio_virtualservice.go#L246). That's why adding works definitely works. We could share numbers with prototype if necessary.

Avoid Redundant Filtering. Move annotation-based filtering before iterating servers and hosts. If a gateway is irrelevant, skip it early.

Unfortunately this won't help for us. Because all Istio resources in the cluster we need to process.

The alternative would be deploy multiple external dns and let each of them work on a subset of Istio resources. We decided that's a sub-optimal approach because it would require dev team to coordinate deployments of external dns with Istio CR generation, potentially even in different teams. Therefore we think it's better to make an OSS PR change so that others in the community managing large number of Istio resources with external dns can also benefit.

Hope this helps!

@mloiseleur
Copy link
Collaborator

API server request like ...

🤔 With an API server request for each VirtualService, then parallelizing R requests cross G gateways may flood this API server with G*R requests in //, no ?

Wdyt about reducing the number of required API Server requests by treating them in batch instead of one by one ?

@ivankatliarchuk
Copy link
Contributor

ivankatliarchuk commented Jun 16, 2025

By any chance, could you first do simple improvements?

Step 1.

Step 2.
Worth to try implement Indexers. Example commit with suggestion #5493 (comment) and implementation #5493 (comment)

And make sure to understand, that API calls are not only calls made to AWS API or any other provider, but internall kubernets calls are API calls.

Step 3

Step 4

  • This PR with workers we will consider it. At the moment this is just a brute force solution, where bottleneck not identified.

@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants