Skip to content

Anomaly Detection

Willie Wheeler edited this page Jun 14, 2018 · 34 revisions

Anomaly Detection overview

The Anomaly Detection module supports the first phase of a two-phase alert generation process. In the first phase, our goal is to quickly identify candidate anomalies through time series outlier analysis. In the second phase, supported by the Anomaly Validation module, we aim to eliminate false positives by performing a more thorough investigation of the candidate anomalies. Validated anomalies generate alerts.

Anomaly Detection architecture

Metric topic

This is a single topic that receives all incoming metrics.

(For discussion: Different topics for different message formats. There was a discussion around the need to support both Metrics 2.0 and Graphite messages. Need to work through this.)

Metric Router

The Metric Router routes metric points to model-specific Anomaly Detector Managers. The Metric Router uses the Model Service to get mapping information.

Anomaly Detector Managers

An Anomaly Detector Manager (ADM) manages a large number of anomaly detectors of a given type. In general a group of ADMs will work together to process metrics for some given type. For example, one ADM might handle processing for 10,000 EWMA metrics and a separate ADM might handle processing for a different set of 10,000 EWMA metrics.

Anomaly Detectors

The anomaly detectors are the heart of the Anomaly Detection system. Their job is to quickly classify each incoming metric point as normal, a weak anomaly or a strong anomaly, and then forward the classified metric point downstream for further analysis.

Classified metric topic

Following anomaly detection, each metric point has one of the following classifications:

  • Normal
  • Weak anomaly
  • Strong anomaly

Anomaly detectors place classified metrics on the classified metric topic, where the Anomaly Validation system picks them up for further processing.

Anatomy of an anomaly detector

Anomaly detectors have an internal structure that it's helpful to understand. Let's take a look.

Metric Predictor

Most anomaly detectors are backed by a time series prediction model. The predictions allow us to assign numerical outlier scores to incoming metrics by looking at the prediction error, or the difference between the observed value and the predicted value. Different prediction models have different ways of assigning outlier scores.

Note that the scores don't immediately translate into anomaly classifications. They are simply a numerical score indicating the extent to which the prediction model considers the metric point an outlier. This score becomes an input into the classification model below.

Metric Predictor Perfmon

Many time series prediction models grow "stale" over time. This happens for example with the time series has an underlying trend that has changed since the last model build, or when the model has some other characteristic (like seasonality) that has evolved since that build.

Typically such models have a regular rebuild schedule to avoid problems like this. However, we also have a Metric Predictor performance monitor that applies standard measures like RMSE and sMAPE to score the model's performance itself. We feed these scores back into the metric topic for visualization and anomaly detection. (Note that we don't keep stacking these perfmon series on top of each other. We do just one layer.) If a perfmon series goes out of whack, we can hook it up to an external remediation workflow that can kick off a model build or else simply log the issue for future correction.

Anomaly Classifier

Following the time series prediction, we now have an outlier score. The anomaly classifier's job is to convert that score into a classification: normal, weak anomaly or strong anomaly. For example, this could be a matter of treating anything two sigmas out as a weak anomaly, and three sigmas out as a strong anomaly.

This is usually a judgment call that we tune in response to user feedback since we usually don't have "ground truth" as to whether a given classification is correct. There may be a way to capture user feedback and incorporate it into classifier training, and that's something we're interested in exploring.

Anomaly Classifier Perfmon

Currently this is a placeholder component, mostly to communicate the idea that the anomaly classifier in principle can perform either well or poorly. (That is, when there's an alert, there's a fact of the matter whether some underlying process actually changed in a way to produce the outlier, even if we don't know it.) In cases where the ground truth is known, using the F1 score (see also precision/recall) is common.

List of time series prediction models

Constant threshold

[FIXME This isn't a time series prediction model]

Applies user-selected constant thresholds to a given time series. One threshold is for weak anomalies and the other is for strong anomalies. It is similar to Seyren in this respect.

Currently the constant threshold detector is single-tailed, meaning that the detector checks for values that are too low (left-tailed) or too high (right-tailed), but not both. We'll enhance the detector to allow two-tailed checks at some point.

Exponentially Weighted Moving Average

The Exponentially Weighted Moving Average (EWMA) model is a weighted sum of the current observation on the one hand and the aggregation of all historical observations on the other. The model has a single parameter alpha in the range [0, 1] that determines the relative weight to put on current vs. historical values. Placing more weight on the current observation makes the model more responsive to changes but more prone to overfitting.

See https://en.wikipedia.org/wiki/Moving_average#Exponential_moving_average.

Probabilistic Exponentially Weighted Moving Average

The PEWMA model is a modified version of the EWMA model. The model estimates the probability of a given observation (based on assuming normally distributed residuals and a running estimate of the variance) and then downweights unlikely observations. A second parameter beta in the range [0, 1] controls the extent of the downweighting. PEWMA reduces to EWMA when the beta is 0.

See https://www.ll.mit.edu/mission/cybersec/publications/publication-files/full_papers/2012_08_05_Carter_IEEESSP_FP.pdf for more information on PEWMA (and EWMA for that matter).

List of model evaluators

Root mean square error

Root mean square error (RMSE) takes the sum of the squared errors, and then takes the square root to get the result back in the same scale as the errors. See https://en.wikipedia.org/wiki/Root-mean-square_deviation.

Future models and evaluators

These are anomaly detection algorithms we're interested in incorporating:

Here are some model evaluators we're interested in: