Skip to content

HangtingYe/UADB

Repository files navigation

Official code and data repository of [UADB: Unsupervised Anomaly Detection Booster]. Please star, watch, and fork UADB for the active updates!

What is UADB?

UADB is a booster for unsupervised anomaly detection (UAD) on tabular tasks. Note that UADB is not a universal winner on all taular tasks, however, it is a model-agnostic framework that can generally enhance any UAD on all types of tabular datasets in a unified way.

How to train?

Prepare (create Results first)

  • mkdir Results

Select tabular data and source UAD needed to be enhanced

  • modify config.py

Run UADB

  • python main.py

Mainstream Unsupervised Anomaly Detection (UAD) Models.

Isolation Forest (IForest) paper that isolates observations by randomly selecting a feature and a splitting point;

Histogram-based outlier detection (HBOS) paper assumes the feature independence and calculates the degree of outlyingness by building histograms;

Local Outlier Factor (LOF) paper measures the local deviation of the density of a sample with respect to its neighbors;

K-Nearest Neighbors (KNN) paper views an instance's distance to its kth nearest neighbor as the outlying score;

Principal Component Analysis (PCA) paper is a linear dimensionality reduction using Singular Value Decomposition of the data to project it to a lower dimensional space. In anomaly detection, it projects the data to the lower dimensional space and then reconstruct it, thus the reconstruction errors are viewed as the anomaly scores;

One-class SVM (OCSVM) paper maximizes the margin between the abnormal and the normal samples, and uses the hyperplane that determines the margin for decision;

Clustering Based Local Outlier Factor (CBLOF) paper classifies the samples into small clusters and large clusters and then using the distance among clusters as anomaly scores;

Connectivity-Based Outlier Factor (COF) paper uses the ratio of average chaining distance of data point and the average of average chaining distance of k nearest neighbor of the data point, as the outlier score for observations;

Subspace Outlier Detection (SOD) paper detects outlier in varying subspaces of a high dimensional feature space;

Empirical-Cumulative-distribution-based Outlier Detection (ECOD) paper is a parameter-free, highly interpretable outlier detection algorithm based on empirical CDF functions;

Gaussian Mixture Models (GMM) paper fit k Gaussians to the data. Then for each data point, calculate the probabilities of belonging to each of the clusters, where the lower probabilities indicate higher anomaly scores;

Lightweight on-line detector of anomalies (LODA) paper is an ensemble detector and is particularly useful in domains where a large number of samples need to be processed in real-time or in domains where the data stream is subject to concept drift and the detector needs to be updated online;

Copula Based Outlier Detector (COPOD) paper is a parameter-free, highly interpretable outlier detection algorithm based on empirical copula models;

Deep Support Vector Data Description (DeepSVDD) paper trains a neural network while minimizing the volume of a hypersphere that encloses the network representations of the data, the distance of the transformed embedding to the hypersphere's center is used to calculate the anomaly score.

Parameters description of source UAD models.

For all source UAD models, we use their default parameters in their original papers (which have been fine-tuned to achieve the best performance). Please refer to PyOD for more information. The following codes show the example to import UAD models. Please see the Table for complete source UAD models included in UADB and their parameter setting links.

from pyod.models.iforest import IForest
from pyod.models.hbos import HBOS
from pyod.models.pca import PCA
from pyod.models.ocsvm import OCSVM
from pyod.models.lof import LOF
from pyod.models.cblof import CBLOF
from pyod.models.cof import COF
from pyod.models.knn import KNN
from pyod.models.sod import SOD
from pyod.models.ecod import ECOD
from pyod.models.deep_svdd import DeepSVDD
from pyod.models.loda import LODA
from pyod.models.copod import COPOD
from pyod.models.gmm import GMM

def get_init_labels(self):
        pseudo_models = {'pca':PCA(), 'iforest':IForest(), 'hbos':HBOS(), 'ocsvm':OCSVM(), 'lof':LOF(), 'cblof':CBLOF(), 'cof':COF(), 'knn':KNN(), 'sod':SOD(), 'ecod':ECOD(), 'deep_svdd':DeepSVDD(), 'loda':LODA(), 'copod':COPOD(), 'gmm':GMM()}
        # model = IForest()
        model = pseudo_models[self.config.pseudo_model]
        model.fit(self.inputs)
        score = model.decision_function(self.inputs)
        score = MinMaxScaler().fit_transform(score.reshape(-1, 1))
        return score
Model Source
IForest Link
HBOS link
LOF link
KNN link
PCA link
OCSVM link
CBLOF link
COF link
SOD link
ECOD link
GMM link
LODA link
COPOD link
DeepSVDD link

Runtime of iterative training with 10 iterations on 84 tabular datasets.

For the default UADB Setup (i.e. 3-layer MLP, hidden dimension=128, epochs=10, batch size=256, learning rate=0.001, training iterations=10), the average runtime on 84 tabular datasets is 49 seconds, the minimum runtime is 32 seconds and maximum runtime is 65 seconds (evaluated on an NVIDIA Tesla V100 GPU with 16 GiB RAM).

Dataset time (seconds)
1_abalone 45.97332
2_ALOI 62.09925
3_annthyroid 55.46907
4_Arrhythmia 39.06063
5_breastw 36.12518
6_cardio 33.07165
7_Cardiotocography 33.20133
9_concrete 31.79388
10_cover 64.64015
11_fault 39.39834
12_glass 37.71293
13_HeartDisease 37.38352
14_Hepatitis 38.01874
15_http 61.8857
16_imgseg 40.50281
17_InternetAds 45.73133
18_Ionosphere 37.77196
19_landsat 53.47792
20_letter 40.62032
21_Lymphography 38.29522
23_mammography 64.00619
24_mnist 59.77169
25_musk 45.249
26_optdigits 48.52583
27_PageBlocks 48.49409
28_Parkinson 37.96288
29_pendigits 54.03328
30_Pima 38.34605
31_satellite 53.66807
32_satimage-2 50.83709
33_shuttle 61.62736
34_skin 63.52546
35_smtp 61.39091
36_SpamBase 45.18679
37_speech 47.31526
38_Stamps 37.86243
39_thyroid 46.09605
40_vertebral 36.9513
41_vowels 38.88513
42_Waveform 45.00404
43_WBC 38.99096
44_WDBC 35.97541
45_Wilt 46.98276
46_wine 37.96173
47_WPBC 37.3959
48_yeast 39.1033
49_CIFAR10_0 52.26063
49_CIFAR10_1 52.14188
49_CIFAR10_2 45.87855
49_CIFAR10_3 46.25881
49_CIFAR10_4 46.14659
49_CIFAR10_5 52.32802
49_CIFAR10_6 51.16338
49_CIFAR10_7 53.35524
49_CIFAR10_8 53.4575
49_CIFAR10_9 50.95251
50_FashionMNIST_0 54.4895
50_FashionMNIST_1 53.37795
50_FashionMNIST_2 45.93535
50_FashionMNIST_3 47.02001
50_FashionMNIST_4 45.6286
50_FashionMNIST_5 46.80384
50_FashionMNIST_6 43.90277
50_FashionMNIST_7 45.52983
50_FashionMNIST_8 49.52233
50_FashionMNIST_9 49.5379
51_SVHN_0 53.37439
51_SVHN_1 55.47349
51_SVHN_2 63.5158
51_SVHN_3 56.23896
51_SVHN_4 56.96233
51_SVHN_5 51.96182
51_SVHN_6 51.20783
51_SVHN_7 64.19061
51_SVHN_8 52.99753
51_SVHN_9 63.43625
52_agnews_0 58.6838
52_agnews_1 56.56641
52_agnews_2 56.18832
52_agnews_3 63.5042
53_amazon 59.95886
54_imdb 64.06704
55_yelp 58.62

Surprising effects on source UAD's decision boundaries on synthetic datasets.

image

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages