[Algorithm] Implement real-time anomaly detection for metrics #7

Fengrui-Liu · 2022-06-28T16:51:14Z

The goals of our project are:

A gRPC data source for metrics (Actually I'm not quite sure with this, I'd refer to the implementation of the gRPC for log [Engine] Log data ingestion from gRPC data source #6 ).
Detection algorithms implementation
- Implementation
  - Eight different algorithms are recorded in StreamAD repo by now, more algorithms are coming.
  - Pre-processing functions: holidays filter, period extraction, referring to Prophet.
  - Post-processing functions: anomaly score thresholder
- Evaluation
  - A wide evaluation on different datasets.
- Code review and test

Superskyyy · 2022-06-28T16:54:45Z

Thanks! For the algorithms, we can directly install package or git submodule from your repository.

Fengrui-Liu · 2022-06-28T17:05:02Z

By now, although the algorithm has not been evaluated yet, I personally prefer SPOT. It can show us clearly dynamic upper and lower bounds (Some commercial products have this function, like datadog), which is more user-friendly than others. Of course, Prophet is also an option.

Superskyyy · 2022-06-28T17:23:29Z

That is great, let's keep this preference in mind and test. I will provide you with a gRPC implementation for exporting metrics if needed, but for very early testing purposes, a simple generator function will suffice to mock a stream.

Fengrui-Liu · 2022-07-09T01:48:11Z

Recently, I'm trying to make the benchmark. I realize that a single detector may not be enough to meet generalization. I'm trying to introduce AutoML-related technology to automatically select the best from different detectors.

Superskyyy · 2022-07-09T02:24:21Z

Recently, I'm trying to make the benchmark. I realize that a single detector may not be enough to meet generalization. I'm trying to introduce AutoML-related technology to automatically select the best from different detectors.

Many commercial vendors use such techniques to provide reliable results, I believe it's the right direction to go. Good luck and keep me updated so we can collaborate.

Superskyyy · 2022-07-10T07:17:16Z

@Fengrui-Liu Are the metrics algorithms trained incrementally online or offline periodically? (I checked the spot paper and they say both are doable, but idk the actual tradeoff)

I'm considering in terms of orchestration. When there are many models (1+ for each metric stream) need to be trained at the same time, this will introduce an overhead to a single node of engine when it also has to do other computation (log analysis, ingestion, inference etc.) Python cannot handle this much easily without multiprocessing and it will most likely lead to unmaintainable code.

So we best to scale them out either by a periodic learning task scheduler (airflow) or assign continuous learning tasks to 1-N analyzer nodes.

The final thing will have engine core, data ingestion, analyzers (actual learner workers) each as standalone modules, so it naturally has the basis for scaling and can work/die independently.

Fengrui-Liu · 2022-07-10T12:09:42Z

Are the metrics algorithms trained incrementally online or offline periodically?

By now, all the algorithms that we have implemented are trained incrementally.

When there are many models (1+ for each metric stream) need to be trained at the same time, this will introduce an overhead to a single node of engine when it also has to do other computation (log analysis, ingestion, inference etc.) Python cannot handle this much easily without multiprocessing and it will most likely lead to unmaintainable code.

Exactly, computing consumption also needs to be considered. I think this can be a reason why those commercial products do not deploy complex models. But in my opinion,

one kind of automl-related technology only run all models (more than one) at the beginning phase, and then select few of them to continuously process the follow-up data.
we can make the users decide which metrics to detect, because those agents can export hundreds of metrics, not each of them is helpful to solve the chaos.

So we best to scale them out either by a periodic learning task scheduler (airflow) or assign continuous learning tasks to 1-N analyzer nodes.

Both of the options are OK for our detectors by now. For periodic detection, we can use fit() and score() separately, and for continuous learning, we use fit_socre()

The final thing will have engine core, data ingestion, analyzers (actual learner workers) each as standalone modules, so it naturally has the basis for scaling and can work/die independently.

This can be achieved by instantiating objects. @Superskyyy

Superskyyy · 2022-07-11T02:21:09Z

Good insights, I'm deciding to move away from Airflow (it was never intended for streaming ETL purposes). We will only rely on a simple MQ to implement the orchestration. In the end, this is just a secondary system to a secondary system (Monitoring platform) and should be as simple to learn as possible.

we can make the users decide which metrics to detect, because those agents can export hundreds of metrics, not each of them is helpful to solve the chaos.

Yes this is intended behaviour, the skywalking metrics exporter natively supports partial subscription.

This can be achieved by instantiating objects. @Superskyyy

The standalone modules are a common pattern in today's containerization deployments. In our case, each node only communicates via a Redis task queue, they don't even need to know the existence of others. In a local machine installation, everything will still be bundled together without any remote nodes (which I'm implementing right now, ideal for testing and first release)

Superskyyy added the Algorithm The work is on the algorithm side label Jun 28, 2022

Superskyyy added type: feature A feature to be implemented analysis: metrics labels Jun 28, 2022

Superskyyy assigned Fengrui-Liu Jun 28, 2022

Superskyyy added this to the 0.1.0 milestone Jun 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Algorithm] Implement real-time anomaly detection for metrics #7

[Algorithm] Implement real-time anomaly detection for metrics #7

Fengrui-Liu commented Jun 28, 2022

Superskyyy commented Jun 28, 2022

Fengrui-Liu commented Jun 28, 2022

Superskyyy commented Jun 28, 2022

Fengrui-Liu commented Jul 9, 2022

Superskyyy commented Jul 9, 2022 •

edited

Loading

Superskyyy commented Jul 10, 2022 •

edited

Loading

Fengrui-Liu commented Jul 10, 2022

Superskyyy commented Jul 11, 2022 •

edited

Loading

[Algorithm] Implement real-time anomaly detection for metrics #7

[Algorithm] Implement real-time anomaly detection for metrics #7

Comments

Fengrui-Liu commented Jun 28, 2022

Superskyyy commented Jun 28, 2022

Fengrui-Liu commented Jun 28, 2022

Superskyyy commented Jun 28, 2022

Fengrui-Liu commented Jul 9, 2022

Superskyyy commented Jul 9, 2022 • edited Loading

Superskyyy commented Jul 10, 2022 • edited Loading

Fengrui-Liu commented Jul 10, 2022

Superskyyy commented Jul 11, 2022 • edited Loading

Superskyyy commented Jul 9, 2022 •

edited

Loading

Superskyyy commented Jul 10, 2022 •

edited

Loading

Superskyyy commented Jul 11, 2022 •

edited

Loading