Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NOPS-1034 added threshold handling for asPercent and divideSeries #165

Merged
merged 2 commits into from
Nov 3, 2021

Conversation

AniaMakes
Copy link
Contributor

Companion to #164

Graphite datapoints are values stored with their timestamp. But, if there are no events during that timestamp, the value will be null and not 0, ie denoting explicit lack of a value. (https://graphite.readthedocs.io/en/latest/terminology.html#graphite-terminology). So in some cases null is an expected - and even desirable value. For example if we're checking the error.count then 0 is ideal as that means no errors are occurring.

Things get a little bit interesting when we look at asPercent and divideSeries. Those two graphite functions are used in the same way - we take a sum of errors in _time_ and a sum of successes in _time_ and we either express the result as a percentage or division result. I think the desired metrics is the ratio of errors to successes; I'm not quite sure either asPercent or divideSeries reflects that, but it's good enough to be used as a valid metric. Anyhow, what this means is practice is most of the time sum of successes = _positive integer_ and sum of errors = null (ie 0). But zero divided by positive integer is still zero (for divideSeries), and zero times 100, then divided by positive integer is also zero (for asPercent).

What this PR does

  • it adds custom handling for asPercent and divideSeries
  • simplifies test headers for sumSeries because I got very confused

What this PR does not do
My initial though would be - once we checked that the result of asPercent or divideSeries was null, we'd run a fetch against the second metric of the check and fail if that's null too. I totally forgot you can't do await inside a map. As such we have two options: 1️⃣ merge this PR and call it done, given that's an improvement on what was there before; and 2️⃣ if we are really concerned about the possibility of Graphite being the weak point, we should go through the repos that are using those checks and add a n-health graphiteWorking check.

When reviewing this PR please state your preference for 1️⃣ vs 2️⃣ vs any other thought you might have on this matter.

@AniaMakes AniaMakes requested a review from a team as a code owner October 27, 2021 12:37
if(result.target && asPercentRegex.test(result.target) || result.target && divideSeriesRegex.test(result.target)){
const fetchCountPerTimeUnit = result.datapoints.map(item => Number(item[0]));
const isFailing = this.direction === 'above' ?
Number(fetchCountPerTimeUnit[0]) > this.threshold :
Copy link
Contributor

@serena97 serena97 Oct 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: I think I need some more context here in understanding why we're only using the first element of fetchCountPerTimeUnit to determine whether it is failing. Is it because we can be sure that there is only 1 datapoint in the array for divideSeries or asPercent, and that datapoint is the result of the query? From what I understand, if there is more than 1 datapoint in the array, the api does not return the result of the query so the datapoints look like

"datapoints": [[1.0, 1635125460], [null, 1635129060], ...]

so I would have thought that we would need to do the calculations ourselves here, similar to how summarize(sumSeries) is done

Copy link
Contributor

@serena97 serena97 Oct 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, I just saw your tests and it answered my question - it is indeed 1 item in the datapoints array

datapoints: [[ 8, 1635125640]]

suggestion: as there's only 1 item in the datapoints array, perhaps we can just change line 54 to
const ratioOfErrorsToSuccesses = Number(result.datapoints[0][0])
to make it more readable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a slightly different approach. All the health checks I've seen have only one datapoint, but you are correct that this may not always be the case - I added a log to let us know if that's not the case (so we can iterate as needed).

});
});

it('Should be unhealthy if below threshold', function (done) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: would it make sense for it to be healthy (instead of unhealthy) below the threshold? In my mind, if we get the % of errors against the successes, a low value suggests that there's a low failure rate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tests checks that the direction: 'below' works. I doubt the actual functionality would be used because this functionality is generally used for rater of errors to successes. But sometimes you may want to track ratios of other events.

@serena97
Copy link
Contributor

Hey Ania, this looks good to me! I think we should merge this PR in because it helps identify the ratio of errors to successes and therefore is an improvement on what was there before. For option 2, I'd like to understand why Graphite is considered a weak point

@AniaMakes AniaMakes merged commit e8f9c1d into main Nov 3, 2021
@AniaMakes AniaMakes deleted the divide-and-asPercent branch November 3, 2021 10:18
JSRedondo pushed a commit that referenced this pull request Jun 20, 2022
NOPS-1034 added threshold handling for asPercent and divideSeries
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants