New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TT-5741 PrometheusPump write optimizations #452
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgmt! @tbuchaillot check tests that are failing
…actor/prometheus-write
Kudos, SonarCloud Quality Gate passed! 0 Bugs No Coverage information |
@tbuchaillot should we merge it into release-1.6? |
/release to release-1.6 |
Working on it! Note that it can take a few minutes. |
* prometheus write optms * refactoring prometheus code + adding tests * linting comment * removing unused var * linting test errors * solving base metrics init + adding more code comment * fixing TestPromtheusCreateBasicMetrics test (cherry picked from commit 2622d9b)
@tbuchaillot Succesfully merged |
Description
At the moment, our
PrometheusPump
has multipleCounterVectors
(the way Prometheus exposes its metrics). When we get a batch of analytics records we increment theCounterVectors
record-by-record using theInc()
function from the Prometheus library.We have discovered that the Prometheus library uses mutexes in order to increment the counter in a thread-safe way. So basically, we are locking and unlocking mutexes for each individual record.
The locking and unlocking of the mutexes are CPU intensive - so in high throughput scenarios, we incur ever-increasing CPU overhead.
The proposed solution for this problem is to start treating the analytics records in batches, rather than individually. We will use a map to aggregate the metrics in memory (per analytics record batch) and write them using the
Add()
function rather than individually usingInc()
.In this way, we will lower the number of locks/unlocks and decrease the CPU usage in high throughput environments.
In order to do that, this PR starts using our custom
PrometheusMetrics
struct (as we used for custom PrometheusMetrics) for all the Prometheus metrics we write. We init all our default metrics inCreateBasicMetrics
method and hold them on ourallMetrics
variable in thePrometheusPump
struct.On data purge - when
WriteData
method is called - we loop all our current metrics ( default + custom ) and we call our customInc()
orObserve()
functions depending on the metric type per analytic record.After we looped over all the analytics records, we are going to call
Expose()
function over each metric. This function will expose the metrics values on Prometheus client usingAdd()
to increment the counter orObserve()
to register the request time. After exposing the metrics to Prometheus, we reset the memory maps.In order to save the labels values in the map, we concatenate them using "--" separator and store them as a key.
For Histogram metrics, we added an extra configuration
aggregate_observations
in order to enable or disable this behavior. This is because the histogram metrics are more complex to calculate (compared to a counter) and enabling this batching-like behavior is going to result in precision loss from a metric perspective; before we were registering request time from each request in a request batch, now we are registering the average request time in a request batch.This solution should be 100% backward compatible by default. If you set
aggregate_observations
true, you should have less precision in yourtyk_latency
metric.Related Issue
https://tyktech.atlassian.net/browse/TT-5741
Motivation and Context
https://tyktech.atlassian.net/browse/TT-5741
How This Has Been Tested
Added a LOT of unit tests.
Screenshots (if appropriate)
Types of changes
Checklist
fork, don't request your
master
!master
branch (left side). Also, you should startyour branch off our latest
master
.go mod tidy && go mod vendor
go fmt -s
go vet