Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support of cloudwatch StatisticSet #141

Closed
pm47 opened this issue Mar 18, 2013 · 4 comments
Closed

Support of cloudwatch StatisticSet #141

pm47 opened this issue Mar 18, 2013 · 4 comments

Comments

@pm47
Copy link

pm47 commented Mar 18, 2013

Hello,

I am currently looking at servo to monitor response time of a REST service that is potentially called many times a second. I'd like to send data to Cloudwatch, obviously it needs some aggregation somewhere.

In Cloudwatch documentation (see Aggregation), they take almost the same use case and say that good practice is to use a StatisticSet.

It makes sense because it allows to both aggregate data early in the process, and still not loose valuable information such as Min, Max, Average, total number of measure points.

I believe this is an extremely common use case, therefore can I ask you guys how you handle this at Neflix? Do you create separate metrics for each of min/max/etc.?

Cheers

Pierre

@brharrington
Copy link
Contributor

Yes, we treat them as separate metrics with a tag (dimension in Amazon's terminology) indicating the statistic. See BasicTimer for an example:

https://github.com/Netflix/servo/blob/master/servo-core/src/main/java/com/netflix/servo/monitor/BasicTimer.java

In Amazon's case there are 4 fixed statistics min, max, sum, and count (along with avg which is just a computed stat for sum / count) and that is essentially the same set used by BasicTimer. We could try to have the cloudwatch observer map this to a statistic set so it is a bit more natural in CloudWatch. The catch is that we don't necessarily enforce that set, for example StatsTimer offers percentiles, stddev, etc:

https://github.com/Netflix/servo/blob/master/servo-core/src/main/java/com/netflix/servo/monitor/StatsTimer.java

Since these cannot be a statistic in the CloudWatch sense we would send that in as a dimension when going to CloudWatch. The other caveat with some of these stats is that they don't aggregate well after being sampled on a particular instance. We typically run clusters of many instances and the stats in BasicTimer and Amazon CloudWatch can be accurately aggregated to get the value across a cluster using the samples. A percentile cannot be aggregated in this way, but can still be useful for detecting outlier instances with less noise than just falling back to max. For our internal use-cases we have our own monitoring backend that gets data from instances each minute and so we can look at the say 90th percentile across instances in the cluster to find outliers. We only send metrics to CloudWatch for use with auto-scaling policies and these are typically aggregates across all instances in the auto scaling group we don't send them with an InstanceId dimension since we can get that info internally.

@pm47
Copy link
Author

pm47 commented Mar 18, 2013

That answers my question, thanks!

@pm47 pm47 closed this as completed Mar 18, 2013
@pm47
Copy link
Author

pm47 commented Mar 28, 2013

On second thought, there is something I am still missing.

Let's take my use case (measuring response time of a particular function with many calls/minute). Let's suppose the function I want to monitor gets called 10 times a second, and let's say I send data to cloudwatch once every minute.

I need aggregation, not because I want to compute the min/max/... response time from application startup to now, but because I collect more data points that I can send to cloudwatch (which is precisely the use case of StatisticSet).

Therefore I don't think mapping a BasicTimer min, max, sum and count to a StatisticSet would make any sense in that respect anyway.

I saw that there is a ResettableMonitor interface, and I suppose maybe I should put an additional step between the BasicTimer and the CloudWatchMetricObserver, that periodically reset the BasicTimer, but I'm not sure how to do it properly or if it even makes sense.

Do you see my point?

Thanks

Pierre

@pm47 pm47 reopened this Mar 28, 2013
@brharrington
Copy link
Contributor

I think what you may be missing is:

https://github.com/Netflix/servo/wiki#converting-counter-values-to-rates

You would most likely want to wrap with a CounterToRateMetricTransform so that what is reported is the rate per second rather than the cumulative total. You can see an example of how how some of this would typically be configured in the example app:

https://github.com/Netflix/servo/wiki/Example-Application

I'll try to explain the reasoning to hopefully make it more clear why we do this and if you find it helpful we'll try to fit it back into the main docs as it might be useful to others in the future.

A basic counter for us is just a monotonically increasing integer. The rate transform caches the previous sample so that when we poll again we can compute a delta between the current sample and the previous sample to get the rate of change. We like this because it is cheap to update, usually just an increment of an AtomicLong, and we can have multiple pollers query it at different frequencies and still get an accurate rate for each poller because we would configure a separate rate transform for each. One downside to this though is that you don't get a value in the wrapped observer until the second poll because we need two samples to compute a delta. This usually isn't much of a problem for us because it is a long lived monitor and the first interval is typically early before load is going to an app.

A resettable monitor is used to indicate to the pollers that the value only makes sense in a given window and needs to be reset to get an accurate view for that interval, for example what is the max value for an interval. The downside is that now we have to be aware of multiple pollers.

To give a concrete example, we might want to log to a local store on the instance every 10 seconds for more fine grained debugging info and log to cloudwatch every minute. If we used a resettable counter and the 10 second poller resets the count, the value reported to cloudwatch will be wrong because it would only include the count since the last time the 10 second poller reset the value. With a monotonic counter and independent rate transforms we can have accurate rates for both the 10 second local file poller and 1 minute cloudwatch poller.

This doesn't work for all statistics, like max, so for those we make them resettable and pick a primary poller that will actually do the reset. So in the example above we would set cloudwatch to be the primary poller and the local file poller would just sample the max and not reset the value. That is the reason for the reset flag on the poll call to MetricPoller:

https://github.com/Netflix/servo/blob/master/servo-core/src/main/java/com/netflix/servo/publish/MetricPoller.java

Bringing this back to basic timer it is a composite with 4 monitors:

  • totalTime - montonic counter of time recorded
  • count - monitonic counter of number of things recorded
  • max - resettable gauge keeping track of the max value since last reset
  • min - resettable gauge keeping track on the min value since last reset

One thing that is clear is we probably need to provide a utility that simplifies setting up a background poller that typically does the "right thing" without users needing to worry about all of these details. Even if it isn't used directly by some users they could look at it to get a better idea of where to start and what flags to keep in mind. I'll file that in a separate jira.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants