Workload Log Anomaly Detection

Summary:

When it comes to Kubernetes cluster setups, log messages consist of valuable information that depict the current status of the cluster. However, often times in a working environment, when something is wrong with a cluster, there are too many log messages to parse through and a simple keyword search for error related keywords is not sufficient in detecting the central problem. This is where Opni log anomaly detection comes into play for workload logs. Users will be able to select workloads of interest and then train a Deep Learning model which will be used to infer on all logs from the selected workloads. This setup does require an NVIDIA GPU when it comes to training a model but if a model has already been trained, then inferencing can be done with solely the CPU.

Architecture:

Workload Log Anomaly Detection Architecture (1)

Components

Scale and performance:

A description of how the system will be scaled and the expected performance characteristics, including any performance metrics that will be used to measure success.

Security:

A description of the security considerations for the system

High availability:

A description of how the system will be designed for high availability, including any redundancy or failover mechanisms that will be implemented. Currently, the ingest plugin, preprocessing service, CPU Inferencing Service, and Opensearch Updating service are all designed in a manner where pods can be scaled up. The workload DRAIN service however does not support a proper HA setup because it keeps track of a local workload DRAIN cache so scaling up the pods will lead to several DRAIN caches being kept.

Testing:

A description of the test plan, including any manual testing and steps to reproduce

Architecture

Backends
Core Components
- Opni Gateway
- Opni Agent

How Tos

Releases

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workload Log Anomaly Detection

Workload Log Anomaly Detection

Summary:

Table of contents

Architecture:

Components

Scale and performance:

Security:

High availability:

Testing:

Clone this wiki locally