Skip to content
Willie Wheeler edited this page Jun 19, 2018 · 4 revisions

Adaptive Alerting wiki

Welcome to the Adaptive Alerting wiki. Please use the menu in the sidebar to have a look around.

Goals

We're creating AA with several goals in mind:

  • The overarching goal is to reduce mean time to detect (MTTD) production incidents. This is the average time between the onset of some production incident and somebody's knowing about it, where "somebody" could be a person or else an automated response system. There are different ways of thinking about what it means to "know about" an incident, but for us it doesn't count if there's an alert that gets lost in a flood of alerts, even if there's some sense in which the monitoring system "knew about" the incident. (AA can monitor things besides production incidents too. For example it can monitor signals that predict production incidents. But fundamentally we're trying to keep the site up.)
  • To support this, we need to monitor as many signals of health as possible. Otherwise we miss problems, and MTTD suffers. The signals or metrics in question can represent business-, application- or system-level concerns. Our working assumption is that in a large business there are many thousands if not millions of such concerns, and so AA needs to scale accordingly.
  • For scalability, we must aggressively limit the number of false positives (i.e., spurious alerts). These draw attention away from the true positives and undermine the effectiveness of the monitoring system.
  • Also for scalability, we must automate model selection and tuning. Typical users won't know the difference between (say) an EWMA-based anomaly detector and an LSTM-based anomaly detector, much less how to tune them. Multiply that by thousands or millions of metrics and it's clear that we have to automate this.