Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a Counter Aggregator #119

Merged
merged 4 commits into from
May 9, 2019
Merged

Add a Counter Aggregator #119

merged 4 commits into from
May 9, 2019

Conversation

knyar
Copy link
Contributor

@knyar knyar commented Apr 15, 2019

Counter Aggregator allows exporting a sum of multiple Prometheus counters to Stackdriver as a single CUMULATIVE metric.

This can be useful if you have so many counter metrics (or metrics with high label cardinality) in Prometheus that importing all of them to Stackdriver directly might be too expensive, however you would like to have a total cumulative counter that corresponds to the sum of those counters.

Copy link
Contributor

@StevenYCChou StevenYCChou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a thorough review - this is a review just to point our the existing concern I have so far.

cmd/stackdriver-prometheus-sidecar/main.go Outdated Show resolved Hide resolved
retrieval/aggregator.go Outdated Show resolved Hide resolved
retrieval/series_cache.go Show resolved Hide resolved
retrieval/series_cache.go Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
@jkohen
Copy link
Contributor

jkohen commented Apr 22, 2019

@knyar I have a higher level question. What is the motivation to do this, when Prometheus itself can do the same aggregation and many more? We want to keep the sidecar nimble and predictable, and I'm concerned that this feature and the new metrics are above the minimally required.

@knyar
Copy link
Contributor Author

knyar commented Apr 23, 2019

@knyar I have a higher level question. What is the motivation to do this, when Prometheus itself can do the same aggregation and many more?

I don't think there is a way in Prometheus to export a cumulative metric that corresponds to the sum of multiple counters if those counters can be reset independently.

Recording rules can only be used to export a sum of per-second rates (e.g. sum(rate(http_requests_total[2m]))), which can only be imported to Stackdriver as a gauge. Creating a recording rule with a sum of counters (sum(http_requests_total)) will produce incorrect data when some of those counters get reset.

Copy link
Contributor

@jkohen jkohen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anton, based on your explanation, I believe that this feature is useful, thanks for sending the change!

I see that you added a new mechanism for talking to the monitoring backend, and I'd like to understand better the reasons. The sidecar today uses the Stackdriver Monitoring API library and manages its own queues for batching, retries, etc. This PR introduces OpenCensus as a second active path with its own queuing and retry logic. It's not obvious how these two will interact. Can you think what it would take to merge the aggregator into the existing export path instead?

I think it's reasonable to use OpenCensus for exporting the collected data to Stackdriver, but I would want to see it as a replacement, not as a second active path. I also expect us to have to perform a more exhaustive test before we can migrate the export path to OpenCensus, because it's likely that the performance and resource usage will change.

README.md Outdated Show resolved Hide resolved
README.md Show resolved Hide resolved
retrieval/aggregator.go Outdated Show resolved Hide resolved
retrieval/aggregator.go Show resolved Hide resolved
retrieval/aggregator.go Show resolved Hide resolved
@knyar
Copy link
Contributor Author

knyar commented Apr 29, 2019

Thanks for your comments, @jkohen.

It's not obvious how these two will interact. Can you think what it would take to merge the aggregator into the existing export path instead?

Sure, the main reason to use OpenCensus here is that it already implements everything one needs to export cumulative metrics to Stackdriver: tracks start time, creates Stackdriver protos, flushes data according to a predefined interval. As you've mentioned, the sidecar is already instrumented to export OpenCensus metrics, and the two paths don't really overlap or interact with one another.

After taking a quick look at the existing output path, I believe we will need to add the following additional functionality to Counter Aggregator to make it use QueueManager rather than OpenCensus for reporting aggregated counters to Stackdriver:

  1. For each aggregated counter, we'll need to prepare a monitoring_pb.TimeSeries proto in a way similar to how seriesCache does it, and calculate its hash (later used to assign a QueueManager shard).
  2. Instead of relying on reporting cycle implemented by OpenCensus, counter aggregator will need to implement regular flushing of accumulated counter values to QueueManager. This will likely require a separate goroutine with a timer for each counter.
  3. Counter Aggregator will need to keep track of start times for each counter, fill values in the prepared monitoring_pb.TimeSeries protos, and push them to QueueManager.

Overall, I think seriesCache + QueueManager are well optimized for high-bandwidth proxying of samples from Prometheus WAL to Stackdriver, but are too low level for a simple task of exporting a small number of counters to Stackdriver, for which OpenCensus provides a simpler and more intuitive API. It seems like integrating counter aggregator with QueueManager will bring additional complexity without significant benefits, but I am happy to take a stab at doing it if it's something you feel strongly about.

@knyar
Copy link
Contributor Author

knyar commented May 1, 2019

@jkohen, @StevenYCChou, I've taken another look at this PR, and I believe I've resolved all comments that had a clear call for action.

This is currently blocked on your decision on whether it's ok for Counter Aggregator to use OpenCensus (which would be my preference, since it's simpler), or if you'd like me to integrate it with QueueManager.

@jkohen
Copy link
Contributor

jkohen commented May 7, 2019

Sorry for the delay. @StevenYCChou and I discussed your response offline, and I'm convinced to accept the new path you added. However, I still want to see some validation of performance and resource usage, because we are pulling in a large dependency with OpenCensus.

We have a benchmark here: https://github.com/Stackdriver/stackdriver-prometheus-sidecar/tree/master/bench

You'd have to trigger the codepaths you added. Let us know if you encounter issues, because we haven't used it in a while.

Copy link
Contributor

@jkohen jkohen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code generally looks good to me. I have a few minor comments.

README.md Outdated Show resolved Hide resolved
retrieval/aggregator.go Show resolved Hide resolved
retrieval/series_cache_test.go Outdated Show resolved Hide resolved
retrieval/series_cache_test.go Outdated Show resolved Hide resolved
retrieval/aggregator_test.go Outdated Show resolved Hide resolved
retrieval/aggregator.go Outdated Show resolved Hide resolved
retrieval/aggregator.go Outdated Show resolved Hide resolved
@knyar
Copy link
Contributor Author

knyar commented May 8, 2019

However, I still want to see some validation of performance and resource usage, because we are pulling in a large dependency with OpenCensus.
We have a benchmark here: https://github.com/Stackdriver/stackdriver-prometheus-sidecar/tree/master/bench
You'd have to trigger the codepaths you added. Let us know if you encounter issues, because we haven't used it in a while.

I've made a few changes necessary to run the benchmark with counter aggregator enabled and kept it running for 30 minutes.

I am not sure what data you'd like to see, but in this Drive folder (shared with you, but not public) you can find the output along with metric dumps for Prometheus server, old sidecar (before this PR) and the new sidecar.

Please let me know if you'd like me to add these changes to this PR as well.

@jkohen
Copy link
Contributor

jkohen commented May 8, 2019

Thanks for running the benchmark. It'd be interesting to look at resource usage for both sidecar versions, maybe other metrics like RPC rate, just for sanity. Can you share it? It should be in Stackdriver :)

@knyar
Copy link
Contributor Author

knyar commented May 8, 2019

Thanks for running the benchmark. It'd be interesting to look at resource usage for both sidecar versions, maybe other metrics like RPC rate, just for sanity. Can you share it? It should be in Stackdriver :)

I believe the benchmark is executed against a fake Stackdriver client in bench/main.go, so no data is written to Stackdriver. However, I've shared metric dumps from both old and new sidecar versions in my previous comment.

CPU and RAM usage is comparable.

In metrics/sidecar_old.2019-05-08T12_04_02:

go_memstats_alloc_bytes_total 2.90230536232e+11
process_cpu_seconds_total 4282.33

In metrics/sidecar.2019-05-08T12_04_01:

go_memstats_alloc_bytes_total 2.89940170256e+11
process_cpu_seconds_total 4301.3

RPC stats are in the same files, but there's too much data for me to attach it here.

@knyar
Copy link
Contributor Author

knyar commented May 9, 2019

Please let me know if you have any other comments, or if you would like me to squash the commits for this PR to be merged.

I am also curious whether you'd like my changes to the benchmark to be included in this PR or not.

Thanks!

@jkohen
Copy link
Contributor

jkohen commented May 9, 2019

Please let me know if you have any other comments, or if you would like me to squash the commits for this PR to be merged.

I or YCChou will squash before merge, don't worry.

I am also curious whether you'd like my changes to the benchmark to be included in this PR or not.

Let's look at these separately.

Thanks for running the benchmark. It'd be interesting to look at resource usage for both sidecar versions, maybe other metrics like RPC rate, just for sanity. Can you share it? It should be in Stackdriver :)

I believe the benchmark is executed against a fake Stackdriver client in bench/main.go, so no data is written to Stackdriver. However, I've shared metric dumps from both old and new sidecar versions in my previous comment.

CPU and RAM usage is comparable.

In metrics/sidecar_old.2019-05-08T12_04_02:

go_memstats_alloc_bytes_total 2.90230536232e+11
process_cpu_seconds_total 4282.33

In metrics/sidecar.2019-05-08T12_04_01:

go_memstats_alloc_bytes_total 2.89940170256e+11
process_cpu_seconds_total 4301.3

RPC stats are in the same files, but there's too much data for me to attach it here.

Looks good. I also spot checked another pair of files 15 minutes into the test and I see the same. Not as cool as a time series, but it does the job for now. Thank you!

@jkohen
Copy link
Contributor

jkohen commented May 9, 2019

@StevenYCChou can you confirm that this OK with you?

@StevenYCChou
Copy link
Contributor

Yes, LGTM.

@StevenYCChou StevenYCChou merged commit 16dfa35 into Stackdriver:master May 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants