Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TT-5741 PrometheusPump write optimizations #452

Merged
merged 8 commits into from Jul 6, 2022

Conversation

tbuchaillot
Copy link
Contributor

@tbuchaillot tbuchaillot commented Jun 15, 2022

Description

At the moment, our PrometheusPump has multiple CounterVectors (the way Prometheus exposes its metrics). When we get a batch of analytics records we increment the CounterVectors record-by-record using the Inc() function from the Prometheus library.
We have discovered that the Prometheus library uses mutexes in order to increment the counter in a thread-safe way. So basically, we are locking and unlocking mutexes for each individual record.
The locking and unlocking of the mutexes are CPU intensive - so in high throughput scenarios, we incur ever-increasing CPU overhead.

The proposed solution for this problem is to start treating the analytics records in batches, rather than individually. We will use a map to aggregate the metrics in memory (per analytics record batch) and write them using the Add() function rather than individually using Inc().
In this way, we will lower the number of locks/unlocks and decrease the CPU usage in high throughput environments.

In order to do that, this PR starts using our custom PrometheusMetrics struct (as we used for custom PrometheusMetrics) for all the Prometheus metrics we write. We init all our default metrics in CreateBasicMetrics method and hold them on our allMetrics variable in the PrometheusPump struct.

On data purge - when WriteData method is called - we loop all our current metrics ( default + custom ) and we call our custom Inc() or Observe() functions depending on the metric type per analytic record.
After we looped over all the analytics records, we are going to call Expose() function over each metric. This function will expose the metrics values on Prometheus client using Add() to increment the counter or Observe() to register the request time. After exposing the metrics to Prometheus, we reset the memory maps.

In order to save the labels values in the map, we concatenate them using "--" separator and store them as a key.

For Histogram metrics, we added an extra configuration aggregate_observations in order to enable or disable this behavior. This is because the histogram metrics are more complex to calculate (compared to a counter) and enabling this batching-like behavior is going to result in precision loss from a metric perspective; before we were registering request time from each request in a request batch, now we are registering the average request time in a request batch.

This solution should be 100% backward compatible by default. If you set aggregate_observations true, you should have less precision in your tyk_latency metric.

Related Issue

https://tyktech.atlassian.net/browse/TT-5741

Motivation and Context

https://tyktech.atlassian.net/browse/TT-5741

How This Has Been Tested

Added a LOT of unit tests.

Screenshots (if appropriate)

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • [ X] New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist

  • Make sure you are requesting to pull a topic/feature/bugfix branch (right side). If pulling from your own
    fork, don't request your master!
  • Make sure you are making a pull request against the master branch (left side). Also, you should start
    your branch off our latest master.
  • My change requires a change to the documentation.
    • If you've changed APIs, describe what needs to be updated in the documentation.
  • I have updated the documentation accordingly.
  • Modules and vendor dependencies have been updated; run go mod tidy && go mod vendor
  • I have added tests to cover my changes.
  • All new and existing tests passed.
  • Check your code additions will not fail linting checks:
    • go fmt -s
    • go vet

@tbuchaillot tbuchaillot marked this pull request as ready for review June 23, 2022 18:26
@tbuchaillot tbuchaillot changed the title prometheus write optimizations TT-5741 PrometheusPump write optimizations Jun 23, 2022
@tbuchaillot tbuchaillot requested a review from sredxny June 28, 2022 17:02
Copy link
Contributor

@sredxny sredxny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgmt! @tbuchaillot check tests that are failing

@sonarcloud
Copy link

sonarcloud bot commented Jul 6, 2022

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 1 Code Smell

No Coverage information No Coverage information
0.0% 0.0% Duplication

@sredxny sredxny merged commit 2622d9b into master Jul 6, 2022
@sredxny sredxny deleted the refactor/prometheus-write branch July 6, 2022 22:13
@sredxny
Copy link
Contributor

sredxny commented Jul 6, 2022

@tbuchaillot should we merge it into release-1.6?

@tbuchaillot
Copy link
Contributor Author

/release to release-1.6

@tykbot
Copy link

tykbot bot commented Jul 7, 2022

Working on it! Note that it can take a few minutes.

tykbot bot pushed a commit that referenced this pull request Jul 7, 2022
* prometheus write optms

* refactoring prometheus code + adding tests

* linting comment

* removing unused var

* linting test errors

* solving base metrics init + adding more code comment

* fixing TestPromtheusCreateBasicMetrics test

(cherry picked from commit 2622d9b)
@tykbot
Copy link

tykbot bot commented Jul 7, 2022

@tbuchaillot Succesfully merged 2622d9b6865d32135f09d36b6fddfef6b3762263 to release-1.6 branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants