Permalink
Find file
Fetching contributors…
Cannot retrieve contributors at this time
29 lines (20 sloc) 2.86 KB

Tackling Alert Fatigue

Accompanying Repository for the "Recovering From Alert Fagitue" talk given at Monitorama 2016 [Slides]

Abstract

Systems that generate numerous critical alerts result in alert fatigue which can result in service outages and developer burnout. My team at Twitter found themselves in this situation. The services had scaled by an order of magnitude in two years and were generating hundreds of alerts per quarter. Over the course of a quarter I led an initiative to decrease the number of alerts, improve the experience of being on call, and increase the reliability of the services. These efforts were incredibly successful reducing the number of critical alerts by 50%. In this talk I’ll discuss the process and alerting best practices we’ve put in places to successfully combat alert fatigue and avoid over alerting in the future.

References

Observability at Twitter

Related Tweets

Bio

Caitie McCaffrey is a Backend Brat and Distributed Systems Diva at Twitter, where she is the Tech Lead of the Observability Team. Prior to that she spent the majority of her career building large scale services and systems that power the entertainment industry at 343 Industries, Microsoft Game Studios, and HBO. Caitie has a degree in Computer Science from Cornell University, and has worked on several video games including Gears of War 2, Gears of War 3, Halo 4, and Halo 5 She maintains a blog at CaitieM.com and frequently discusses technology on Twitter @Caitie