Address the sensitive error rate check #476

sjparkinson · 2017-10-20T13:54:22Z

Looking at preflight with these values during a 10 minute period when the alert was going off.

A baseline period of 7 days, a sample period of 10 minutes.

Over 7 days, the baseline error rate for next-preflight was 0.00001.

Over the last 10 minutes, the sample error rate was 0.00000571.

For graphiteSpike we take these values and compute the following...

const ok = sample / baseline < threshold;

With a threshold of 0.04, we find that sample / baseline is 0.571 and so we're not OK.

Using the default threshold of 3 in this case I believe will catch major spikes in errors introduced by a release without being too sensitive.

I think the confusion was in thinking that threshold is a percentage value.

A dashboard for next-preflight I used to work out what Graphite was returning, http://grafana.ft.com/dashboard/db/next-n-express-error-rate.

🐿 v2.5.16

coveralls · 2017-10-20T14:08:11Z

Coverage decreased (-0.07%) to 88.401% when pulling b1c5dfb on error-rate-patch into 79b1b63 on master.

geek-caroline

I'm glad you understand the maths properly :)

I'll probably pop by on Monday for a full explanation but I think your comments are pretty clear. Thank you for looking into this properly!

Address the sensitive error rate check.

9c93118

🐿 v2.5.16

sjparkinson requested review from geek-caroline and GlynnPhillips October 20, 2017 13:54

Update the test.

b1c5dfb

🐿 v2.5.16

geek-caroline approved these changes Oct 20, 2017

View reviewed changes

sjparkinson merged commit 9194326 into master Oct 20, 2017

sjparkinson deleted the error-rate-patch branch October 20, 2017 14:37

adambraimbridge mentioned this pull request Jun 4, 2019

Show spike data in error message Financial-Times/n-health#117

Merged

Provide feedback