Support of cloudwatch StatisticSet #141

pm47 · 2013-03-18T10:30:57Z

Hello,

I am currently looking at servo to monitor response time of a REST service that is potentially called many times a second. I'd like to send data to Cloudwatch, obviously it needs some aggregation somewhere.

In Cloudwatch documentation (see Aggregation), they take almost the same use case and say that good practice is to use a StatisticSet.

It makes sense because it allows to both aggregate data early in the process, and still not loose valuable information such as Min, Max, Average, total number of measure points.

I believe this is an extremely common use case, therefore can I ask you guys how you handle this at Neflix? Do you create separate metrics for each of min/max/etc.?

Cheers

Pierre

brharrington · 2013-03-18T15:01:27Z

Yes, we treat them as separate metrics with a tag (dimension in Amazon's terminology) indicating the statistic. See BasicTimer for an example:

https://github.com/Netflix/servo/blob/master/servo-core/src/main/java/com/netflix/servo/monitor/BasicTimer.java

In Amazon's case there are 4 fixed statistics min, max, sum, and count (along with avg which is just a computed stat for sum / count) and that is essentially the same set used by BasicTimer. We could try to have the cloudwatch observer map this to a statistic set so it is a bit more natural in CloudWatch. The catch is that we don't necessarily enforce that set, for example StatsTimer offers percentiles, stddev, etc:

https://github.com/Netflix/servo/blob/master/servo-core/src/main/java/com/netflix/servo/monitor/StatsTimer.java

Since these cannot be a statistic in the CloudWatch sense we would send that in as a dimension when going to CloudWatch. The other caveat with some of these stats is that they don't aggregate well after being sampled on a particular instance. We typically run clusters of many instances and the stats in BasicTimer and Amazon CloudWatch can be accurately aggregated to get the value across a cluster using the samples. A percentile cannot be aggregated in this way, but can still be useful for detecting outlier instances with less noise than just falling back to max. For our internal use-cases we have our own monitoring backend that gets data from instances each minute and so we can look at the say 90th percentile across instances in the cluster to find outliers. We only send metrics to CloudWatch for use with auto-scaling policies and these are typically aggregates across all instances in the auto scaling group we don't send them with an InstanceId dimension since we can get that info internally.

pm47 · 2013-03-18T15:17:45Z

That answers my question, thanks!

pm47 · 2013-03-28T10:27:11Z

On second thought, there is something I am still missing.

Let's take my use case (measuring response time of a particular function with many calls/minute). Let's suppose the function I want to monitor gets called 10 times a second, and let's say I send data to cloudwatch once every minute.

I need aggregation, not because I want to compute the min/max/... response time from application startup to now, but because I collect more data points that I can send to cloudwatch (which is precisely the use case of StatisticSet).

Therefore I don't think mapping a BasicTimer min, max, sum and count to a StatisticSet would make any sense in that respect anyway.

I saw that there is a ResettableMonitor interface, and I suppose maybe I should put an additional step between the BasicTimer and the CloudWatchMetricObserver, that periodically reset the BasicTimer, but I'm not sure how to do it properly or if it even makes sense.

Do you see my point?

Thanks

Pierre

brharrington · 2013-03-28T13:53:20Z

I think what you may be missing is:

https://github.com/Netflix/servo/wiki#converting-counter-values-to-rates

You would most likely want to wrap with a CounterToRateMetricTransform so that what is reported is the rate per second rather than the cumulative total. You can see an example of how how some of this would typically be configured in the example app:

https://github.com/Netflix/servo/wiki/Example-Application

I'll try to explain the reasoning to hopefully make it more clear why we do this and if you find it helpful we'll try to fit it back into the main docs as it might be useful to others in the future.

A basic counter for us is just a monotonically increasing integer. The rate transform caches the previous sample so that when we poll again we can compute a delta between the current sample and the previous sample to get the rate of change. We like this because it is cheap to update, usually just an increment of an AtomicLong, and we can have multiple pollers query it at different frequencies and still get an accurate rate for each poller because we would configure a separate rate transform for each. One downside to this though is that you don't get a value in the wrapped observer until the second poll because we need two samples to compute a delta. This usually isn't much of a problem for us because it is a long lived monitor and the first interval is typically early before load is going to an app.

A resettable monitor is used to indicate to the pollers that the value only makes sense in a given window and needs to be reset to get an accurate view for that interval, for example what is the max value for an interval. The downside is that now we have to be aware of multiple pollers.

To give a concrete example, we might want to log to a local store on the instance every 10 seconds for more fine grained debugging info and log to cloudwatch every minute. If we used a resettable counter and the 10 second poller resets the count, the value reported to cloudwatch will be wrong because it would only include the count since the last time the 10 second poller reset the value. With a monotonic counter and independent rate transforms we can have accurate rates for both the 10 second local file poller and 1 minute cloudwatch poller.

This doesn't work for all statistics, like max, so for those we make them resettable and pick a primary poller that will actually do the reset. So in the example above we would set cloudwatch to be the primary poller and the local file poller would just sample the max and not reset the value. That is the reason for the reset flag on the poll call to MetricPoller:

https://github.com/Netflix/servo/blob/master/servo-core/src/main/java/com/netflix/servo/publish/MetricPoller.java

Bringing this back to basic timer it is a composite with 4 monitors:

totalTime - montonic counter of time recorded
count - monitonic counter of number of things recorded
max - resettable gauge keeping track of the max value since last reset
min - resettable gauge keeping track on the min value since last reset

One thing that is clear is we probably need to provide a utility that simplifies setting up a background poller that typically does the "right thing" without users needing to worry about all of these details. Even if it isn't used directly by some users they could look at it to get a better idea of where to start and what flags to keep in mind. I'll file that in a separate jira.

pm47 closed this as completed Mar 18, 2013

pm47 reopened this Mar 28, 2013

brharrington mentioned this issue Mar 28, 2013

simplify process of setting up background publishing #143

Closed

brharrington closed this as completed Aug 22, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support of cloudwatch StatisticSet #141

Support of cloudwatch StatisticSet #141

pm47 commented Mar 18, 2013

brharrington commented Mar 18, 2013

pm47 commented Mar 18, 2013

pm47 commented Mar 28, 2013

brharrington commented Mar 28, 2013

Support of cloudwatch StatisticSet #141

Support of cloudwatch StatisticSet #141

Comments

pm47 commented Mar 18, 2013

brharrington commented Mar 18, 2013

pm47 commented Mar 18, 2013

pm47 commented Mar 28, 2013

brharrington commented Mar 28, 2013