Add a Counter Aggregator #119

knyar · 2019-04-15T15:35:02Z

Counter Aggregator allows exporting a sum of multiple Prometheus counters to Stackdriver as a single CUMULATIVE metric.

This can be useful if you have so many counter metrics (or metrics with high label cardinality) in Prometheus that importing all of them to Stackdriver directly might be too expensive, however you would like to have a total cumulative counter that corresponds to the sum of those counters.

StevenYCChou

This is not a thorough review - this is a review just to point our the existing concern I have so far.

cmd/stackdriver-prometheus-sidecar/main.go

retrieval/aggregator.go

retrieval/series_cache.go

README.md

jkohen · 2019-04-22T19:31:59Z

@knyar I have a higher level question. What is the motivation to do this, when Prometheus itself can do the same aggregation and many more? We want to keep the sidecar nimble and predictable, and I'm concerned that this feature and the new metrics are above the minimally required.

knyar · 2019-04-23T11:17:33Z

@knyar I have a higher level question. What is the motivation to do this, when Prometheus itself can do the same aggregation and many more?

I don't think there is a way in Prometheus to export a cumulative metric that corresponds to the sum of multiple counters if those counters can be reset independently.

Recording rules can only be used to export a sum of per-second rates (e.g. sum(rate(http_requests_total[2m]))), which can only be imported to Stackdriver as a gauge. Creating a recording rule with a sum of counters (sum(http_requests_total)) will produce incorrect data when some of those counters get reset.

jkohen

Anton, based on your explanation, I believe that this feature is useful, thanks for sending the change!

I see that you added a new mechanism for talking to the monitoring backend, and I'd like to understand better the reasons. The sidecar today uses the Stackdriver Monitoring API library and manages its own queues for batching, retries, etc. This PR introduces OpenCensus as a second active path with its own queuing and retry logic. It's not obvious how these two will interact. Can you think what it would take to merge the aggregator into the existing export path instead?

I think it's reasonable to use OpenCensus for exporting the collected data to Stackdriver, but I would want to see it as a replacement, not as a second active path. I also expect us to have to perform a more exhaustive test before we can migrate the export path to OpenCensus, because it's likely that the performance and resource usage will change.

README.md

retrieval/aggregator.go

knyar · 2019-04-29T12:42:39Z

Thanks for your comments, @jkohen.

It's not obvious how these two will interact. Can you think what it would take to merge the aggregator into the existing export path instead?

Sure, the main reason to use OpenCensus here is that it already implements everything one needs to export cumulative metrics to Stackdriver: tracks start time, creates Stackdriver protos, flushes data according to a predefined interval. As you've mentioned, the sidecar is already instrumented to export OpenCensus metrics, and the two paths don't really overlap or interact with one another.

After taking a quick look at the existing output path, I believe we will need to add the following additional functionality to Counter Aggregator to make it use QueueManager rather than OpenCensus for reporting aggregated counters to Stackdriver:

For each aggregated counter, we'll need to prepare a monitoring_pb.TimeSeries proto in a way similar to how seriesCache does it, and calculate its hash (later used to assign a QueueManager shard).
Instead of relying on reporting cycle implemented by OpenCensus, counter aggregator will need to implement regular flushing of accumulated counter values to QueueManager. This will likely require a separate goroutine with a timer for each counter.
Counter Aggregator will need to keep track of start times for each counter, fill values in the prepared monitoring_pb.TimeSeries protos, and push them to QueueManager.

Overall, I think seriesCache + QueueManager are well optimized for high-bandwidth proxying of samples from Prometheus WAL to Stackdriver, but are too low level for a simple task of exporting a small number of counters to Stackdriver, for which OpenCensus provides a simpler and more intuitive API. It seems like integrating counter aggregator with QueueManager will bring additional complexity without significant benefits, but I am happy to take a stab at doing it if it's something you feel strongly about.

knyar · 2019-05-01T15:41:12Z

@jkohen, @StevenYCChou, I've taken another look at this PR, and I believe I've resolved all comments that had a clear call for action.

This is currently blocked on your decision on whether it's ok for Counter Aggregator to use OpenCensus (which would be my preference, since it's simpler), or if you'd like me to integrate it with QueueManager.

jkohen · 2019-05-07T22:35:19Z

Sorry for the delay. @StevenYCChou and I discussed your response offline, and I'm convinced to accept the new path you added. However, I still want to see some validation of performance and resource usage, because we are pulling in a large dependency with OpenCensus.

We have a benchmark here: https://github.com/Stackdriver/stackdriver-prometheus-sidecar/tree/master/bench

You'd have to trigger the codepaths you added. Let us know if you encounter issues, because we haven't used it in a while.

jkohen

The code generally looks good to me. I have a few minor comments.

README.md

retrieval/aggregator.go

retrieval/series_cache_test.go

retrieval/aggregator_test.go

retrieval/aggregator.go

Counter Aggregator allows exporting a sum of multiple Prometheus counters to Stackdriver as a single CUMULATIVE metric.

…ges.

knyar · 2019-05-08T19:23:01Z

However, I still want to see some validation of performance and resource usage, because we are pulling in a large dependency with OpenCensus.
We have a benchmark here: https://github.com/Stackdriver/stackdriver-prometheus-sidecar/tree/master/bench
You'd have to trigger the codepaths you added. Let us know if you encounter issues, because we haven't used it in a while.

I've made a few changes necessary to run the benchmark with counter aggregator enabled and kept it running for 30 minutes.

I am not sure what data you'd like to see, but in this Drive folder (shared with you, but not public) you can find the output along with metric dumps for Prometheus server, old sidecar (before this PR) and the new sidecar.

Please let me know if you'd like me to add these changes to this PR as well.

jkohen · 2019-05-08T21:47:15Z

Thanks for running the benchmark. It'd be interesting to look at resource usage for both sidecar versions, maybe other metrics like RPC rate, just for sanity. Can you share it? It should be in Stackdriver :)

knyar · 2019-05-08T22:24:33Z

Thanks for running the benchmark. It'd be interesting to look at resource usage for both sidecar versions, maybe other metrics like RPC rate, just for sanity. Can you share it? It should be in Stackdriver :)

I believe the benchmark is executed against a fake Stackdriver client in bench/main.go, so no data is written to Stackdriver. However, I've shared metric dumps from both old and new sidecar versions in my previous comment.

CPU and RAM usage is comparable.

In metrics/sidecar_old.2019-05-08T12_04_02:

go_memstats_alloc_bytes_total 2.90230536232e+11
process_cpu_seconds_total 4282.33

In metrics/sidecar.2019-05-08T12_04_01:

go_memstats_alloc_bytes_total 2.89940170256e+11
process_cpu_seconds_total 4301.3

RPC stats are in the same files, but there's too much data for me to attach it here.

knyar · 2019-05-09T10:17:02Z

Please let me know if you have any other comments, or if you would like me to squash the commits for this PR to be merged.

I am also curious whether you'd like my changes to the benchmark to be included in this PR or not.

Thanks!

jkohen · 2019-05-09T14:25:55Z

Please let me know if you have any other comments, or if you would like me to squash the commits for this PR to be merged.

I or YCChou will squash before merge, don't worry.

I am also curious whether you'd like my changes to the benchmark to be included in this PR or not.

Let's look at these separately.

Thanks for running the benchmark. It'd be interesting to look at resource usage for both sidecar versions, maybe other metrics like RPC rate, just for sanity. Can you share it? It should be in Stackdriver :)

I believe the benchmark is executed against a fake Stackdriver client in bench/main.go, so no data is written to Stackdriver. However, I've shared metric dumps from both old and new sidecar versions in my previous comment.

CPU and RAM usage is comparable.

In metrics/sidecar_old.2019-05-08T12_04_02:
go_memstats_alloc_bytes_total 2.90230536232e+11
process_cpu_seconds_total 4282.33
In metrics/sidecar.2019-05-08T12_04_01:
go_memstats_alloc_bytes_total 2.89940170256e+11
process_cpu_seconds_total 4301.3
RPC stats are in the same files, but there's too much data for me to attach it here.

Looks good. I also spot checked another pair of files 15 minutes into the test and I see the same. Not as cool as a time series, but it does the job for now. Thank you!

jkohen · 2019-05-09T14:26:31Z

@StevenYCChou can you confirm that this OK with you?

StevenYCChou · 2019-05-09T17:12:49Z

Yes, LGTM.

StevenYCChou self-requested a review April 16, 2019 14:07

googlebot added the cla: yes label Apr 18, 2019

StevenYCChou reviewed Apr 22, 2019

View reviewed changes

jkohen reviewed Apr 26, 2019

View reviewed changes

README.md Outdated Show resolved Hide resolved

README.md Show resolved Hide resolved

retrieval/aggregator.go Outdated Show resolved Hide resolved

retrieval/aggregator.go Show resolved Hide resolved

retrieval/aggregator.go Show resolved Hide resolved

jkohen suggested changes May 8, 2019

View reviewed changes

knyar added 4 commits May 8, 2019 17:02

Add a Counter Aggregator

a6c3337

Counter Aggregator allows exporting a sum of multiple Prometheus counters to Stackdriver as a single CUMULATIVE metric.

Update README, simplify seriesCache.set

a7a07e3

Add more details to README and comments, remove metric prefix

c0a8e97

Enable Stackdriver monitoring backend automatically, other minor chan…

701733e

…ges.

jkohen approved these changes May 9, 2019

View reviewed changes

StevenYCChou merged commit 16dfa35 into Stackdriver:master May 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a Counter Aggregator #119

Add a Counter Aggregator #119

knyar commented Apr 15, 2019

StevenYCChou left a comment

jkohen commented Apr 22, 2019

knyar commented Apr 23, 2019 •

edited

Loading

jkohen left a comment

knyar commented Apr 29, 2019

knyar commented May 1, 2019

jkohen commented May 7, 2019

jkohen left a comment

knyar commented May 8, 2019

jkohen commented May 8, 2019

knyar commented May 8, 2019

knyar commented May 9, 2019

jkohen commented May 9, 2019

jkohen commented May 9, 2019

StevenYCChou commented May 9, 2019

Add a Counter Aggregator #119

Add a Counter Aggregator #119

Conversation

knyar commented Apr 15, 2019

StevenYCChou left a comment

Choose a reason for hiding this comment

jkohen commented Apr 22, 2019

knyar commented Apr 23, 2019 • edited Loading

jkohen left a comment

Choose a reason for hiding this comment

knyar commented Apr 29, 2019

knyar commented May 1, 2019

jkohen commented May 7, 2019

jkohen left a comment

Choose a reason for hiding this comment

knyar commented May 8, 2019

jkohen commented May 8, 2019

knyar commented May 8, 2019

knyar commented May 9, 2019

jkohen commented May 9, 2019

jkohen commented May 9, 2019

StevenYCChou commented May 9, 2019

knyar commented Apr 23, 2019 •

edited

Loading