Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Algorithm] Implement real-time anomaly detection for metrics #7

Open
2 of 9 tasks
Fengrui-Liu opened this issue Jun 28, 2022 · 8 comments
Open
2 of 9 tasks
Assignees
Labels
Algorithm The work is on the algorithm side analysis: metrics type: feature A feature to be implemented
Milestone

Comments

@Fengrui-Liu
Copy link
Collaborator

The goals of our project are:

  • A gRPC data source for metrics (Actually I'm not quite sure with this, I'd refer to the implementation of the gRPC for log [Engine] Log data ingestion from gRPC data source #6 ).
  • Detection algorithms implementation
    • Implementation
      • Eight different algorithms are recorded in StreamAD repo by now, more algorithms are coming.
      • Pre-processing functions: holidays filter, period extraction, referring to Prophet.
      • Post-processing functions: anomaly score thresholder
    • Evaluation
      • A wide evaluation on different datasets.
    • Code review and test
@Superskyyy Superskyyy added the Algorithm The work is on the algorithm side label Jun 28, 2022
@Superskyyy
Copy link
Member

Thanks! For the algorithms, we can directly install package or git submodule from your repository.

@Superskyyy Superskyyy added type: feature A feature to be implemented analysis: metrics labels Jun 28, 2022
@Superskyyy Superskyyy added this to the 0.1.0 milestone Jun 28, 2022
@Fengrui-Liu
Copy link
Collaborator Author

By now, although the algorithm has not been evaluated yet, I personally prefer SPOT. It can show us clearly dynamic upper and lower bounds (Some commercial products have this function, like datadog), which is more user-friendly than others. Of course, Prophet is also an option.

@Superskyyy
Copy link
Member

That is great, let's keep this preference in mind and test. I will provide you with a gRPC implementation for exporting metrics if needed, but for very early testing purposes, a simple generator function will suffice to mock a stream.

@Fengrui-Liu
Copy link
Collaborator Author

Recently, I'm trying to make the benchmark. I realize that a single detector may not be enough to meet generalization. I'm trying to introduce AutoML-related technology to automatically select the best from different detectors.

@Superskyyy
Copy link
Member

Superskyyy commented Jul 9, 2022

Recently, I'm trying to make the benchmark. I realize that a single detector may not be enough to meet generalization. I'm trying to introduce AutoML-related technology to automatically select the best from different detectors.

Many commercial vendors use such techniques to provide reliable results, I believe it's the right direction to go. Good luck and keep me updated so we can collaborate.

@Superskyyy
Copy link
Member

Superskyyy commented Jul 10, 2022

@Fengrui-Liu Are the metrics algorithms trained incrementally online or offline periodically? (I checked the spot paper and they say both are doable, but idk the actual tradeoff)

I'm considering in terms of orchestration. When there are many models (1+ for each metric stream) need to be trained at the same time, this will introduce an overhead to a single node of engine when it also has to do other computation (log analysis, ingestion, inference etc.) Python cannot handle this much easily without multiprocessing and it will most likely lead to unmaintainable code.

So we best to scale them out either by a periodic learning task scheduler (airflow) or assign continuous learning tasks to 1-N analyzer nodes.

The final thing will have engine core, data ingestion, analyzers (actual learner workers) each as standalone modules, so it naturally has the basis for scaling and can work/die independently.

@Fengrui-Liu
Copy link
Collaborator Author

Are the metrics algorithms trained incrementally online or offline periodically?

By now, all the algorithms that we have implemented are trained incrementally.

When there are many models (1+ for each metric stream) need to be trained at the same time, this will introduce an overhead to a single node of engine when it also has to do other computation (log analysis, ingestion, inference etc.) Python cannot handle this much easily without multiprocessing and it will most likely lead to unmaintainable code.

Exactly, computing consumption also needs to be considered. I think this can be a reason why those commercial products do not deploy complex models. But in my opinion,

  1. one kind of automl-related technology only run all models (more than one) at the beginning phase, and then select few of them to continuously process the follow-up data.
  2. we can make the users decide which metrics to detect, because those agents can export hundreds of metrics, not each of them is helpful to solve the chaos.

So we best to scale them out either by a periodic learning task scheduler (airflow) or assign continuous learning tasks to 1-N analyzer nodes.

Both of the options are OK for our detectors by now. For periodic detection, we can use fit() and score() separately, and for continuous learning, we use fit_socre()

The final thing will have engine core, data ingestion, analyzers (actual learner workers) each as standalone modules, so it naturally has the basis for scaling and can work/die independently.

This can be achieved by instantiating objects. @Superskyyy

@Superskyyy
Copy link
Member

Superskyyy commented Jul 11, 2022

Good insights, I'm deciding to move away from Airflow (it was never intended for streaming ETL purposes). We will only rely on a simple MQ to implement the orchestration. In the end, this is just a secondary system to a secondary system (Monitoring platform) and should be as simple to learn as possible.

we can make the users decide which metrics to detect, because those agents can export hundreds of metrics, not each of them is helpful to solve the chaos.

Yes this is intended behaviour, the skywalking metrics exporter natively supports partial subscription.

This can be achieved by instantiating objects. @Superskyyy

The standalone modules are a common pattern in today's containerization deployments. In our case, each node only communicates via a Redis task queue, they don't even need to know the existence of others. In a local machine installation, everything will still be bundled together without any remote nodes (which I'm implementing right now, ideal for testing and first release)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algorithm The work is on the algorithm side analysis: metrics type: feature A feature to be implemented
Projects
Status: In Progress
Development

No branches or pull requests

2 participants