New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf(prometheus): fix upstream health expensive iterate latency spike issue #10949
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
oowl
force-pushed
the
fix/prometheus-exporter
branch
2 times, most recently
from
May 26, 2023 08:33
701c1e3
to
91d3091
Compare
oowl
changed the title
fix(prometheus): reduce upstream helath expensive iterate interval
fix(prometheus): reduce upstream helath expensive iterate collect in every scrape
May 26, 2023
oowl
force-pushed
the
fix/prometheus-exporter
branch
from
May 29, 2023 07:19
91d3091
to
7b0a6e3
Compare
bungle
changed the title
fix(prometheus): reduce upstream helath expensive iterate collect in every scrape
fix(prometheus): reduce upstream health expensive iterate collect in every scrape
May 30, 2023
oowl
force-pushed
the
fix/prometheus-exporter
branch
3 times, most recently
from
June 5, 2023 02:18
d062459
to
891d263
Compare
Please rebase the PR as the master is reset. |
oowl
force-pushed
the
fix/prometheus-exporter
branch
from
June 5, 2023 06:27
891d263
to
2eff7e9
Compare
dndx
reviewed
Jun 5, 2023
oowl
force-pushed
the
fix/prometheus-exporter
branch
from
June 5, 2023 07:51
d347df0
to
777ee2e
Compare
I have tested it
We can see that this PR brings practical performance improvement.
|
oowl
changed the title
fix(prometheus): reduce upstream health expensive iterate collect in every scrape
perf(prometheus): fix upstream health expensive iterate latency spike issue
Jun 5, 2023
oowl
force-pushed
the
fix/prometheus-exporter
branch
from
June 5, 2023 14:07
48f8b9a
to
be8c6ba
Compare
chronolaw
reviewed
Jun 6, 2023
oowl
force-pushed
the
fix/prometheus-exporter
branch
2 times, most recently
from
June 6, 2023 06:39
318eb27
to
20a7f5c
Compare
chronolaw
reviewed
Jun 6, 2023
oowl
force-pushed
the
fix/prometheus-exporter
branch
from
June 7, 2023 02:50
20a7f5c
to
7f69406
Compare
oowl
force-pushed
the
fix/prometheus-exporter
branch
from
June 7, 2023 02:58
f27f0bb
to
9d38d1b
Compare
chronolaw
approved these changes
Jun 7, 2023
oowl
force-pushed
the
fix/prometheus-exporter
branch
2 times, most recently
from
June 7, 2023 10:42
80afa81
to
841886c
Compare
dndx
approved these changes
Jun 8, 2023
oowl
force-pushed
the
fix/prometheus-exporter
branch
from
June 8, 2023 08:36
841886c
to
b1be008
Compare
oowl
force-pushed
the
fix/prometheus-exporter
branch
from
June 8, 2023 08:36
b1be008
to
d3e6d55
Compare
3 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
base on #10749 (comment)
In my testing, the most CPU-intensive task for the Prometheus plugin is the part of iterating through all upstream health statuses. This part blocks for around 400ms on every metrics request (with 10k upstreams and 20k targets), which is unacceptable for us.
And I found the most time-consuming points in this function, which are caused by excessive creation of temporary tables leading to excessive GC pressure, as well as some performance losses caused by
table.insert
string.gsub
due to Luajit NYI. Therefore, the solution to this problem is thatyield
in long loop in upstream iteration.full_metrics_name
function, and reducegsub
function call.In my opinion, this iteration doesn't really need to be done in this way. It's clear that traversing all upstream health statuses every time is not a good solution. However, the use of yield here is merely a workaround, as the required CPU time doesn't disappear, but is simply preempted to prevent it from affecting the delay of the proxy too much.
Checklist
Issue reference
KAG-632