Workload Log Anomaly Detection
When it comes to Kubernetes cluster setups, log messages consist of valuable information that depict the current status of the cluster. However, often times in a working environment, when something is wrong with a cluster, there are too many log messages to parse through and a simple keyword search for error related keywords is not sufficient in detecting the central problem. This is where Opni log anomaly detection comes into play for workload logs. Users will be able to select workloads of interest and then train a Deep Learning model which will be used to infer on all logs from the selected workloads. This setup does require an NVIDIA GPU when it comes to training a model but if a model has already been trained, then inferencing can be done with solely the CPU.
- Opni Ingest Pipeline
- Opni Preprocessing Service
- Opni Workload DRAIN Service
- Opni AIOps Gateway
- Opni Training Controller Service
- Opni GPU Controller Service
- Opni CPU Inferencing Service
- Opni Opensearch Updating Service
A description of how the system will be scaled and the expected performance characteristics, including any performance metrics that will be used to measure success.
A description of the security considerations for the system
A description of how the system will be designed for high availability, including any redundancy or failover mechanisms that will be implemented. Currently, the ingest plugin, preprocessing service, CPU Inferencing Service, and Opensearch Updating service are all designed in a manner where pods can be scaled up. The workload DRAIN service however does not support a proper HA setup because it keeps track of a local workload DRAIN cache so scaling up the pods will lead to several DRAIN caches being kept.
A description of the test plan, including any manual testing and steps to reproduce
Architecture
- Backends
- Core Components