In certain circumstances, we see avg aggregation constantly shows exponential rise from the past to the present time. Here's a graph that's trying to show 3xx responses per second per instance.
It looks like we're in a crisis. The number is rapidly climbing! But this graph always looks like this. What's gone wrong? This happens when a certain set of criteria hold:
We're using a rate calculation.
We're downsampling using the "none" fill option.
Over the period we're looking at, tag combinations come and go again, e.g. we're tagging by instance IP address and over time new IP addresses appear in and old ones disappear.
Our aggregation function is avg or count.
Here's what happens with count instead of avg.
This also looks fishy. We start with 754 time series 7 days ago and fall to 21 time series just now. Here's the raw time series without the rate calculation or the aggregation across hosts:
This is a lot to take in! We've taken off the rate calculation and removed the aggregation across hosts so we can see each separately. That's our 754 separate time series, each counting up separately. Over the course of a week, each only exists for a little while before vanishing, although some come back again later as the host is re-used. Here's one more graph:
We've gone back to count our series again, but this time without the rate calculation. This looks much more like we expect - at any given time we have somewhere between 20 and 60 active time series. Over time this number doesn't change wildly because as we add new instances we remove old instances. So what is happening when we add the rate calculation? I think it's the following:
When we're trying to aggregate together series before some of them have started, we try to fill in the missing series. With "none" for downsampling each time series feeding into the aggregation has its own start and end time. In AggregationIterator the current timestamp on each series is used to determine if it is currently "active" - before we've reached the first point and after we've passed the last one it gets set to 0, and it gets skipped over in hasNextValue whenever the timestamp is 0.
However, when rate is enabled, AggregationIterator deliberately skips over the first value in each time series during initialisation. In so doing, I believe it causes the timestamp to become non-zero and thus all time series in the aggregation begin contributing from the start, potentially long before we have actually advanced to their start time. This is why the average is so low in the past and climbs towards the present! It's getting biased towards zero by hundreds of inactive time series.
What do you think, does this look like a bug? I wonder if AggregationIterator should reset the timestamp back to zero after removing the first spurious value from each time series.
The text was updated successfully, but these errors were encountered:
This one is tricky and its usually related to delayed writes due to assigning UIDs. But that's also dependent on having a write path that can delay a data point for UID assignment and then write it back later on. That also typically only affects tip data as in the most recent few minutes of data.
In this case the query is over multiple days though and since the overall number of time series is dropping its computing the average over fewer values. You can try to fill with zero to bring the value down but thats an artificial drop.
But the best solution would be to look at the sum of 3xxs across host, don't look at the avg since that will give you a false sense as in this case.
Hi! Thanks for looking at this. I'm not sure I understand how your answer relates here. I don't think this one has anything to do with delayed writes? I am moderately sure I've described the mechanism that's causing this, and it's how AggregationIterator decides, for a given point in time, which time series are active and which are not. In a non-rate calculation, a time series isn't active until the timestamp of its first datapoint in the query range, and ceases to be active after its last datapoint in the query range. In a rate calculation, every time series is active from the start of the query range, even if it doesn't have any data until much later. They still cease to be active after their last datapoint.
To clarify further, when I say a time series is inactive, I mean that it is skipped over and does not emit any values from the AggregationIterator. This skipping happens in hasNextValue(true) for time-series that have a zero timestamp. This can be because they haven't been reached yet or because they were zeroed out in next() when we reached the end of them. But specifically for a rate calculation, the constructor advances all of the iterators right away, setting all of them to non-zero timestamps. That's what I'm saying seems a problem. It seems like a bug. We can work around it, and maybe you think it's not worth fixing, but do you understand why it seems wrong to me?
annettejanewilson commentedDec 17, 2019
In certain circumstances, we see avg aggregation constantly shows exponential rise from the past to the present time. Here's a graph that's trying to show 3xx responses per second per instance.
It looks like we're in a crisis. The number is rapidly climbing! But this graph always looks like this. What's gone wrong? This happens when a certain set of criteria hold:
Here's what happens with count instead of avg.
This also looks fishy. We start with 754 time series 7 days ago and fall to 21 time series just now. Here's the raw time series without the rate calculation or the aggregation across hosts:
This is a lot to take in! We've taken off the rate calculation and removed the aggregation across hosts so we can see each separately. That's our 754 separate time series, each counting up separately. Over the course of a week, each only exists for a little while before vanishing, although some come back again later as the host is re-used. Here's one more graph:
We've gone back to count our series again, but this time without the rate calculation. This looks much more like we expect - at any given time we have somewhere between 20 and 60 active time series. Over time this number doesn't change wildly because as we add new instances we remove old instances. So what is happening when we add the rate calculation? I think it's the following:
When we're trying to aggregate together series before some of them have started, we try to fill in the missing series. With "none" for downsampling each time series feeding into the aggregation has its own start and end time. In AggregationIterator the current timestamp on each series is used to determine if it is currently "active" - before we've reached the first point and after we've passed the last one it gets set to 0, and it gets skipped over in hasNextValue whenever the timestamp is 0.
However, when rate is enabled, AggregationIterator deliberately skips over the first value in each time series during initialisation. In so doing, I believe it causes the timestamp to become non-zero and thus all time series in the aggregation begin contributing from the start, potentially long before we have actually advanced to their start time. This is why the average is so low in the past and climbs towards the present! It's getting biased towards zero by hundreds of inactive time series.
What do you think, does this look like a bug? I wonder if AggregationIterator should reset the timestamp back to zero after removing the first spurious value from each time series.
The text was updated successfully, but these errors were encountered: