TT-5741 PrometheusPump write optimizations #452

tbuchaillot · 2022-06-15T16:54:58Z

Description

At the moment, our PrometheusPump has multiple CounterVectors (the way Prometheus exposes its metrics). When we get a batch of analytics records we increment the CounterVectors record-by-record using the Inc() function from the Prometheus library.
We have discovered that the Prometheus library uses mutexes in order to increment the counter in a thread-safe way. So basically, we are locking and unlocking mutexes for each individual record.
The locking and unlocking of the mutexes are CPU intensive - so in high throughput scenarios, we incur ever-increasing CPU overhead.

The proposed solution for this problem is to start treating the analytics records in batches, rather than individually. We will use a map to aggregate the metrics in memory (per analytics record batch) and write them using the Add() function rather than individually using Inc().
In this way, we will lower the number of locks/unlocks and decrease the CPU usage in high throughput environments.

In order to do that, this PR starts using our custom PrometheusMetrics struct (as we used for custom PrometheusMetrics) for all the Prometheus metrics we write. We init all our default metrics in CreateBasicMetrics method and hold them on our allMetrics variable in the PrometheusPump struct.

On data purge - when WriteData method is called - we loop all our current metrics ( default + custom ) and we call our custom Inc() or Observe() functions depending on the metric type per analytic record.
After we looped over all the analytics records, we are going to call Expose() function over each metric. This function will expose the metrics values on Prometheus client using Add() to increment the counter or Observe() to register the request time. After exposing the metrics to Prometheus, we reset the memory maps.

In order to save the labels values in the map, we concatenate them using "--" separator and store them as a key.

For Histogram metrics, we added an extra configuration aggregate_observations in order to enable or disable this behavior. This is because the histogram metrics are more complex to calculate (compared to a counter) and enabling this batching-like behavior is going to result in precision loss from a metric perspective; before we were registering request time from each request in a request batch, now we are registering the average request time in a request batch.

This solution should be 100% backward compatible by default. If you set aggregate_observations true, you should have less precision in your tyk_latency metric.

Related Issue

https://tyktech.atlassian.net/browse/TT-5741

Motivation and Context

https://tyktech.atlassian.net/browse/TT-5741

How This Has Been Tested

Added a LOT of unit tests.

Screenshots (if appropriate)

Types of changes

Bug fix (non-breaking change which fixes an issue)
[ X] New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist

sredxny

lgmt! @tbuchaillot check tests that are failing

…actor/prometheus-write

sonarcloud · 2022-07-06T09:25:03Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
1 Code Smell

No Coverage information
0.0% Duplication

sredxny · 2022-07-06T22:13:37Z

@tbuchaillot should we merge it into release-1.6?

tbuchaillot · 2022-07-07T07:57:24Z

/release to release-1.6

tykbot · 2022-07-07T07:57:32Z

Working on it! Note that it can take a few minutes.

* prometheus write optms * refactoring prometheus code + adding tests * linting comment * removing unused var * linting test errors * solving base metrics init + adding more code comment * fixing TestPromtheusCreateBasicMetrics test (cherry picked from commit 2622d9b)

tykbot · 2022-07-07T07:57:42Z

@tbuchaillot Succesfully merged 2622d9b6865d32135f09d36b6fddfef6b3762263 to release-1.6 branch.

prometheus write optms

f86de85

tbuchaillot marked this pull request as draft June 15, 2022 16:55

tbuchaillot mentioned this pull request Jun 20, 2022

TT-5139 Refactor Prometheus/WriteData method to improve performance #445

Closed

14 tasks

refactoring prometheus code + adding tests

5910d63

tbuchaillot requested a review from matiasinsaurralde June 23, 2022 18:20

tbuchaillot added 2 commits June 23, 2022 20:21

linting comment

f6a890f

removing unused var

62a915d

tbuchaillot marked this pull request as ready for review June 23, 2022 18:26

linting test errors

eaccb2b

tbuchaillot changed the title ~~prometheus write optimizations~~ TT-5741 PrometheusPump write optimizations Jun 23, 2022

matiasinsaurralde approved these changes Jun 27, 2022

View reviewed changes

tbuchaillot requested a review from matiasinsaurralde June 27, 2022 13:46

tbuchaillot added 2 commits June 27, 2022 19:15

solving base metrics init + adding more code comment

ba5a425

fixing TestPromtheusCreateBasicMetrics test

eeb0c4f

tbuchaillot requested a review from sredxny June 28, 2022 17:02

matiasinsaurralde approved these changes Jul 4, 2022

View reviewed changes

sredxny approved these changes Jul 4, 2022

View reviewed changes

Merge branch 'master' of github.com:TykTechnologies/tyk-pump into ref…

2ab232b

…actor/prometheus-write

sredxny merged commit 2622d9b into master Jul 6, 2022

sredxny deleted the refactor/prometheus-write branch July 6, 2022 22:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TT-5741 PrometheusPump write optimizations #452

TT-5741 PrometheusPump write optimizations #452

tbuchaillot commented Jun 15, 2022 •

edited

sredxny left a comment

sonarcloud bot commented Jul 6, 2022

sredxny commented Jul 6, 2022

tbuchaillot commented Jul 7, 2022

tykbot bot commented Jul 7, 2022

tykbot bot commented Jul 7, 2022

TT-5741 PrometheusPump write optimizations #452

TT-5741 PrometheusPump write optimizations #452

Conversation

tbuchaillot commented Jun 15, 2022 • edited

Description

Related Issue

Motivation and Context

How This Has Been Tested

Screenshots (if appropriate)

Types of changes

Checklist

sredxny left a comment

Choose a reason for hiding this comment

sonarcloud bot commented Jul 6, 2022

sredxny commented Jul 6, 2022

tbuchaillot commented Jul 7, 2022

tykbot bot commented Jul 7, 2022

tykbot bot commented Jul 7, 2022

tbuchaillot commented Jun 15, 2022 •

edited