# Replace Outlier Detection by Simple Statistics with ECOD
A NEW python-based, simple, parameter-free, and interpretable anomaly detection method

A common heuristic method for quickly identifying outliers is the “three sigma” rule. This simple technique classifies any point located more than three standard deviations from the mean as an outlier. The “1.5 IQR” rule is another variant of this rule and is more robust to outliers.

However, this common approach only uses a limited amount of information about the data: the mean and standard deviation.

A new and better alternative is ECOD, an abbreviation of “empirical cumulative distribution functions for outlier detection”. The paper was published in 2021.

It has been implemented in the PyOD python package.

It has several key features that make it stand out from competing algorithms:

* No hyperparameters! This is important because is difficult to tune hyperparameters for outlier detection because the true labels are rare, unknown, or difficult to obtain.
* Fast and computationally efficient. The time complexity scales linearly with dataset size and number of dimensions.
* Easy to understand and interpretable.

## How to use ECOD in python


In [2]:
from pyod.utils.data import generate_data
import numpy as np
X_train, y_train, X_test, y_test = \
        generate_data(n_train=200,
                      n_test=100,
                      n_features=5,
                      contamination=0.1,
                      random_state=3) 
X_train = X_train * np.random.uniform(0, 1, size=X_train.shape)
X_test = X_test * np.random.uniform(0,1, size=X_test.shape)

  warn('behaviour="old" is deprecated and will be removed '


Here, I use the generate_data function from PyOD to generate a synthetic dataset with 200 training samples and 100 test samples. The normal samples are generated by a multivariate Gaussian distribution; the outlier samples are generated using a uniform distribution.

Both train and test datasets have 5 features and 10% of rows are labeled as anomalies. I add a bit of random noise to the data to make it slightly harder to perfectly separate normal and outlier points.

In [3]:
from pyod.models.ecod import ECOD
clf_name = 'ECOD'
clf = ECOD()
clf.fit(X_train)

test_scores = clf.decision_function(X_test)

from pyod.utils.utility import precision_n_scores
from sklearn.metrics import roc_auc_score
roc = round(roc_auc_score(y_test, test_scores), ndigits=4)
prn = round(precision_n_scores(y_test, test_scores), ndigits=4)

print(f'{clf_name} ROC:{roc}, precision @ rank n:{prn}')
#>> ECOD ROC:1.0, precision @ rank n:1.0

ECOD ROC:0.9967, precision @ rank n:0.9


As you can see, ECOD was able to perfectly distinguish the generated outliers from the normal points. If you look at the distribution of anomaly scores below, the points are perfectly separable. But in practice, determining the appropriate threshold to identify the outliers is not so straight-forward, absent labels.

In PyOD, a fitted outlier detector has two key functions: decision_function and predict.
* `decision_function` returns an anomaly score for each row.

* `predict` returns an array of 0’s and 1’s, indicating whether each row is predicted to be normal (0) or an outlier (1). The predict function simply applies a threshold to the anomaly score returned by the decision_function. The threshold is automatically calibrated based on the specified contamination rate parameter set when initializing the detector (e.g. clf=ECOD(contamination=0.1). The contamination indicates the expected percentage of outliers in the training data.

**Some practical notes**

ECOD works best on tabular data. It does not consider the order of data and **isn’t well-suited to time series**.

Data does not require rescaling prior to fitting. ECOD fits a separate univariate function per variable. Outlier scores are log probabilities of the original values; the log probabilities for different variables are on the same scale.

**Cool feature: Anomaly Explanation**

ECOD can explain which features contributed most to the outlier score. This is especially helpful when there are many features in your dataset or you want to tell a human reviewer why the algorithm selected a particular row as an outlier.

The two Dimensional Outlier Graphs above plot the feature-level outlier scores for two true outliers (rows) detected as outliers by ECOD with the blue line. The x-axis indicates the feature and the y-axis indicates the outlier score for that feature alone. The 90th and 99th percentile outlier scores are also plotted.

In the left plot (row #182), the outlier score for dimension 2 exceeds the 99th percentile and scores for dimensions 1, 3, and 4 also exceed the 90th percentiles. All of these variables contribute to the outlier classification.

In the right plot (row #184), the outlier score for dimension 1 exceeds the 90th percentile and is the primary reason the row is an outlier.