Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Acceleration update Endpoints #11046

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

zvlb
Copy link

@zvlb zvlb commented Feb 29, 2024

What this PR does / why we need it:

When you have many ingresses in K8s cluster and frequent deploys you have many problems.

Why:

  • The SyncIngresses function takes about one and a half minutes to complete (the longest processes being nginx -t and nginx -s reload). Each process takes about 40 seconds in my case (~3000 ingresses, nginx config file - 200mb)
  • Due to changes in CR Ingress during the execution of SyncIngresses, immediately after one SyncIngresses finishes, the next one starts. And so on in a perpetual loop.

My problem:
The process of updating Endpoints is part of the SyncIngresses function. This means Nginx will only update Endpoints 1-2-3 minutes after the actual IP change of the POD.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • CVE Report (Scanner found CVE and adding report)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation only

Which issue/s this PR fixes

I add new queue for update Only Endpoints, and copy logic from SyncIngresses for update dynamic configs (LUA).

I see a potential issue with data consistency due to the parallel execution of SyncIngresses and SyncEndpoints at the moment n.runningConfig = pcfg. Because of the duration of execution in SyncIngresses, which also updates Endpoints, the list of endpoints will likely be outdated (by one and a half minutes).

(However, I think we can fix it)

How Has This Been Tested?

I start my fork in K8s cluster with ~3000 ingresses and hyperactive developers (deploys every 1 minute)

Checklist:

(I skip adding documetation and test, bc firstly I want discuss about my solution. May be my solution not very good)

Signed-off-by: zvlb <vl.zemtsov@gmail.com>
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 29, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: zvlb
Once this PR has been reviewed and has the lgtm label, please assign puerco for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 29, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @zvlb. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link

netlify bot commented Feb 29, 2024

Deploy Preview for kubernetes-ingress-nginx canceled.

Name Link
🔨 Latest commit c632426
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-ingress-nginx/deploys/65e0bf92c7b51b0008e0fe2a

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Feb 29, 2024
@zvlb zvlb mentioned this pull request Feb 29, 2024
5 tasks
@strongjz strongjz requested review from rikatz and removed request for cpanato February 29, 2024 20:30
@strongjz
Copy link
Member

@zvlb Looks like the nginx config generation e2e tests are complaining about this change.

@strongjz
Copy link
Member

/kind feature
/priority backlog
/triage accepted

We have had others in the past complain about this issue, with large ingress and the syncing taking a while, I believe this will help.

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. priority/backlog Higher priority than priority/awaiting-more-evidence. triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 19, 2024
@strongjz
Copy link
Member

I think we need to discuss this more with the acceleration of endpoint updates and serial reloads.

#10884

This may lead to some unforeseen issues.

cc @rikatz @tao12345666333

@zvlb
Copy link
Author

zvlb commented Mar 19, 2024

@strongjz My changes have big logic problem.
I describe it in slack

When SyncIngresses finished - he delete all endpoints which were created when SyncIngresses started. It's Big Problem.

I can create some 'locks' for fix it.

Copy link

github-actions bot commented May 4, 2024

This is stale, but we won't close it automatically, just bare in mind the maintainers may be busy with other tasks and will reach your issue ASAP. If you have any question or request to prioritize this, please reach #ingress-nginx-dev on Kubernetes Slack.

@github-actions github-actions bot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label May 4, 2024
@k8s-triage-robot
Copy link

The lifecycle/frozen label can not be applied to PRs.

This bot removes lifecycle/frozen from PRs because:

  • Commenting /lifecycle frozen on a PR has not worked since March 2021
  • PRs that remain open for >150 days are unlikely to be easily rebased

You can:

  • Rebase this PR and attempt to get it merged
  • Close this PR with /close

Please send feedback to sig-contributor-experience at kubernetes/community.

/remove-lifecycle frozen

@k8s-ci-robot k8s-ci-robot removed the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label May 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. priority/backlog Higher priority than priority/awaiting-more-evidence. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants