Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(prometheus): fix upstream health expensive iterate latency spike issue #10949

Merged
merged 8 commits into from Jun 8, 2023

Conversation

oowl
Copy link
Member

@oowl oowl commented May 26, 2023

Summary

base on #10749 (comment)
In my testing, the most CPU-intensive task for the Prometheus plugin is the part of iterating through all upstream health statuses. This part blocks for around 400ms on every metrics request (with 10k upstreams and 20k targets), which is unacceptable for us.
And I found the most time-consuming points in this function, which are caused by excessive creation of temporary tables leading to excessive GC pressure, as well as some performance losses caused by table.insert string.gsub due to Luajit NYI. Therefore, the solution to this problem is that

  1. try to yield in long loop in upstream iteration.
  2. fix NYI in full_metrics_name function, and reduce gsub function call.

In my opinion, this iteration doesn't really need to be done in this way. It's clear that traversing all upstream health statuses every time is not a good solution. However, the use of yield here is merely a workaround, as the required CPU time doesn't disappear, but is simply preempted to prevent it from affecting the delay of the proxy too much.

Checklist

Issue reference

KAG-632

@oowl oowl force-pushed the fix/prometheus-exporter branch 2 times, most recently from 701c1e3 to 91d3091 Compare May 26, 2023 08:33
@oowl oowl marked this pull request as draft May 26, 2023 08:34
@oowl oowl changed the title fix(prometheus): reduce upstream helath expensive iterate interval fix(prometheus): reduce upstream helath expensive iterate collect in every scrape May 26, 2023
@oowl oowl force-pushed the fix/prometheus-exporter branch from 91d3091 to 7b0a6e3 Compare May 29, 2023 07:19
@bungle bungle changed the title fix(prometheus): reduce upstream helath expensive iterate collect in every scrape fix(prometheus): reduce upstream health expensive iterate collect in every scrape May 30, 2023
@oowl oowl force-pushed the fix/prometheus-exporter branch 3 times, most recently from d062459 to 891d263 Compare June 5, 2023 02:18
@StarlightIbuki
Copy link
Contributor

Please rebase the PR as the master is reset.

@oowl oowl force-pushed the fix/prometheus-exporter branch from 891d263 to 2eff7e9 Compare June 5, 2023 06:27
kong/plugins/prometheus/exporter.lua Outdated Show resolved Hide resolved
kong/plugins/prometheus/prometheus.lua Outdated Show resolved Hide resolved
@oowl
Copy link
Member Author

oowl commented Jun 5, 2023

I have tested it
using fake upstream and route

upstreams 10000
services 10000
routes 10000

metrics 100383


1. Kong 3.2.2

[root@ip-172-31-17-2 ~]# curl 127.0.0.1:8001/metrics | wc -l
100383

no scrape Prometheus Metrics
root@ip-172-31-23-166:~# wrk -d 100 -t 2 -c 2 -T 100s  http://172.31.17.40:8000 --latency                                                                                                                        
Running 2m test @ http://172.31.17.40:8000                                                                                                                                                                       
  2 threads and 2 connections                                                                                                                                                                                                                                                                                                                                   
  Thread Stats   Avg      Stdev     Max   +/- Stdev                                                                                                                                                              
    Latency   277.27us  315.94us  15.94ms   98.99%                                                                                                                                                               
    Req/Sec     3.77k   528.44     4.71k    78.07%                                                                                                                                                               
  Latency Distribution                                                                                                                                                                                           
     50%  246.00us                                                                                                                                                                                               
     75%  283.00us                                                                                                                                                                                               
     90%  344.00us                                                                                                                                                                                               
     99%  595.00us                                                                                                                                                                                               
  751336 requests in 1.67m, 1.00GB read

scrape Prometheus Metrics every 1s
root@ip-172-31-23-166:~# wrk -d 100 -t 2 -c 2 -T 100s  http://172.31.17.40:8000 --latency                                                                                                                        
Running 2m test @ http://172.31.17.40:8000                                                              
  2 threads and 2 connections                                                                           
  Thread Stats   Avg      Stdev     Max   +/- Stdev                                                     
    Latency    35.10ms  120.22ms   1.03s    88.94%                                                      
    Req/Sec     3.60k     0.87k    4.65k    87.51%                                                      
  Latency Distribution                                                                                  
     50%  253.00us                                                                                      
     75%  314.00us                                                                                      
     90%  176.93ms                                                                                      
     99%  619.46ms                                                                                  
  647078 requests in 1.67m, 0.86GB read


2. This PR patched prometheus plugin

[root@ip-172-31-17-2 ~]# curl  127.0.0.1:8001/metrics | wc -l
100383

no scrape Prometheus Metrics
root@ip-172-31-23-166:~# wrk -d 100 -t 2 -c 2 -T 100s  http://172.31.17.40:8000 --latency                                                                                                                                                                                                                                   
Running 2m test @ http://172.31.17.40:8000                                                                                                                                                                                                                                                                                  
  2 threads and 2 connections                                                                                                                                                                                                                                                                                               
  Thread Stats   Avg      Stdev     Max   +/- Stdev                                                                                                                                                                                                                                                                         
    Latency   232.68us  445.25us  22.15ms   99.33%                                                                                                                                                                                                                                                                          
    Req/Sec     4.69k   665.18     5.99k    81.81%                                                                                                                                                                                                                                                                          
  Latency Distribution                                                                                                                                                                                                                                                                                                      
     50%  197.00us                                                                                                                                                                                                                                                                                                          
     75%  220.00us                                                                                                                                                                                                                                                                                                          
     90%  273.00us                                                                                                                                                                                                                                                                                                          
     99%  559.00us                                                                                                                                                                                                                                                                                                          
  933358 requests in 1.67m, 840.27MB read                                                                                                                                                                                                                                                                                   
Requests/sec:   9324.26                                                                                                                                                                                                                                                                                                     
Transfer/sec:      8.39MB

scrape Prometheus Metrics every 1s

root@ip-172-31-23-166:~# wrk -d 100 -t 2 -c 2 -T 100s  http://172.31.17.40:8000 --latency 
Running 2m test @ http://172.31.17.40:8000
  2 threads and 2 connections                                                  
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    16.02ms   65.95ms 680.76ms   94.15%
    Req/Sec     4.29k     1.35k    5.96k    86.03%
  Latency Distribution                                                         
     50%  202.00us                                                             
     75%  245.00us                                                             
     90%    8.91ms                                                             
     99%  391.52ms                                                             
  808053 requests in 1.67m, 727.46MB read
Requests/sec:   8079.42                                                        
Transfer/sec:      7.27MB

We can see that this PR brings practical performance improvement.

p90:
176ms -> 8.91ms

p99:
691ms -> 391.52ms

@oowl oowl changed the title fix(prometheus): reduce upstream health expensive iterate collect in every scrape perf(prometheus): fix upstream health expensive iterate latency spike issue Jun 5, 2023
@oowl oowl marked this pull request as ready for review June 5, 2023 14:07
@oowl oowl force-pushed the fix/prometheus-exporter branch from 48f8b9a to be8c6ba Compare June 5, 2023 14:07
kong/plugins/prometheus/prometheus.lua Outdated Show resolved Hide resolved
kong/plugins/prometheus/prometheus.lua Outdated Show resolved Hide resolved
kong/plugins/prometheus/prometheus.lua Outdated Show resolved Hide resolved
@oowl oowl force-pushed the fix/prometheus-exporter branch 2 times, most recently from 318eb27 to 20a7f5c Compare June 6, 2023 06:39
@oowl oowl requested a review from chronolaw June 6, 2023 06:39
@oowl oowl requested a review from dndx June 7, 2023 03:02
@oowl oowl force-pushed the fix/prometheus-exporter branch 2 times, most recently from 80afa81 to 841886c Compare June 7, 2023 10:42
@oowl oowl force-pushed the fix/prometheus-exporter branch from 841886c to b1be008 Compare June 8, 2023 08:36
@dndx dndx merged commit 18216ac into master Jun 8, 2023
21 checks passed
@dndx dndx deleted the fix/prometheus-exporter branch June 8, 2023 09:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants