Unsupervised Model Selection for Time-series Anomaly Detection

Most time-series anomaly detection models don't need labels for training. So why should we need labels to select good models?

TL;DR: We introduce `tsadams` for unsupervised time-series anomaly detection model selection!

Hundreds of models for anomaly detection in time-series are available to practitioners, but no method exists to select the best model and its hyperparameters for a given dataset when labels are not available. We construct three classes of surrogate metrics which we show to be correlated with common supervised anomaly detection accuracy metrics such as the F1 score. The three classes of metrics are prediction accuracy, centrality, and performance on injected synthetic anomalies. We show that some of the surrogate metrics are useful for unsupervised model selection but not sufficient by themselves. To this end, we treat metric combinations as a rank aggregation problem and propose a robust rank aggregation approach. Large scale experiments on multiple real-world datasets demonstrate that our proposed unsupervised aggregation approach is as effective as selecting the best model based on collecting anomaly labels.

Figure 1: The Model Selection Workflow. We identify three classes of surrogate metrics of model quality, and propose a novel robust rank aggregation framework to combine multiple rankings from metrics.

If you use this code, please consider citing our work:

Unsupervised Model Selection for Time-series Anomaly Detection
Mononito Goswami, Cristian Ignacio Challu, Laurent Callot, Lenon Minorics, Andrey Kan
International Conference on Learning Representations (ICLR), 2023

Datasets

We carry out experiments on two popular and widely used real-world collections with diverse time-series and anomalies: (1) UCR Anomaly Archive (UCR) (Wu & Keogh, 2021), and (2) Server Machine Dataset (SMD) (Su et al., 2019).

These datasets can be downloaded using the download_data.py script in the scripts directory and loaded using the tsadams.datasets.load.load_data(...) function.

To load the UCR dataset:

    from tsadams.datasets.load import load_data

    # Load the data
    ENTITY = 'anomaly_archive' # 'anomaly_archive' OR 'smd' 
    
    DATASET = '028_UCR_Anomaly_DISTORTEDInternalBleeding17' # Name of timeseries in UCR or machine in SMD
    
    train_data = load_data(dataset=DATASET, 
                           group='train', 
                           entities=[ENTITY], 
                           downsampling=None, 
                           root_dir='/path_to_dataset_dir', 
                           normalize=True, 
                           verbose=True)
    
    test_data = load_data(dataset=DATASET, 
                          group='test', 
                          entities=[ENTITY], 
                          downsampling=None, 
                          root_dir='/path_to_dataset_dir', 
                          normalize=True, 
                          verbose=True)

Installation

We recommend installing Ananconda to run our code. To install Anaconda, review the installation instructions here.

To setup the environment using conda (recommended, but optional), run the following commands:

    # To create environment from environment_explicit.yml file
    foo@bar:~$ conda env create -f environment_explicit.yml
    
    # To activate the environment
    foo@bar:~$ conda activate modelselect 
    
    # To verify if the new environment was installed correctly
    foo@bar:~$ conda env list

For an editable installation of our code from source, run the following commands:

    foo@bar:~$ git clone https://github.com/mononitogoswami/tsad-model-selection.git
    foo@bar:~$ cd tsad-model-selection/src/
    foo@bar:~$ pip install -e .

Reproduce Results

To reproduce the results presented in the paper, please follow these steps in the specified order. You can find all the necessary scripts in the src > scripts directory of this repository:

Run download_data.py to download the Server Machine datasets and the UCR Anomaly Detection archive.
Train multiple anomaly detection models for each dataset using the train_all_models.py. You can track the progress of trained models using the check_number_of_trained_models.py. After this stage, for each dataset in SMD and the UCR anomaly archive, we should have trained anomaly detection models. Please note that, in some cases, certain models may not have completed training due to potential errors.
Next, get predictions (i.e. use each model to reconstruct time-series in each dataset) for all models and datasets using the evaluate_all_models.py. The progress for this step can be tracked using check_number_of_evaluated_models.py.
In the paper, we pool the predictions of a particular model on multiple related datasets. This gives us a more robust measure of performance. To pool predictions of multiple models, run the compute_pooled_results.py`. With this we should be all set to perform model selection!
Perform model selection for each pooled datasets and evaluate it using the results.ipynb notebook.

Citation

If you use our code please cite our paper:

    @article{
        goswami2023unsupervised,
        title={Unsupervised Model Selection for Time-series Anomaly Detection},
        author={Goswami, Mononito and Challu, Cristian and Callot, Laurent and Minorics, Lenon and Kan, Andrey},
        journal={International Conference on Learning Representations.},
        year={2023},
    }

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
assets		assets
configs		configs
src		src
tests		tests
.gitignore		.gitignore
.yapfignore		.yapfignore
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
environment.yml		environment.yml
environment_explicit.yml		environment_explicit.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

assets

assets

configs

configs

src

src

tests

tests

.gitignore

.gitignore

.yapfignore

.yapfignore

LICENSE

LICENSE

NOTICE

NOTICE

README.md

README.md

environment.yml

environment.yml

environment_explicit.yml

environment_explicit.yml

Repository files navigation

Unsupervised Model Selection for Time-series Anomaly Detection

Most time-series anomaly detection models don't need labels for training. So why should we need labels to select good models?

Contents

Datasets

Installation

Reproduce Results

Citation

About

Releases

Packages

Contributors 2

Languages

License

mononitogoswami/tsad-model-selection

Folders and files

Latest commit

History

Repository files navigation

Unsupervised Model Selection for Time-series Anomaly Detection

Most time-series anomaly detection models don't need labels for training. So why should we need labels to select good models?

Contents

Datasets

Installation

Reproduce Results

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Languages