diff --git a/.gitbook/assets/Pod_TCP_diagram.png b/.gitbook/assets/Pod_TCP_diagram.png new file mode 100644 index 000000000..ea55d05c4 Binary files /dev/null and b/.gitbook/assets/Pod_TCP_diagram.png differ diff --git a/SUMMARY.md b/SUMMARY.md index 4fa762e16..c2995b269 100644 --- a/SUMMARY.md +++ b/SUMMARY.md @@ -116,7 +116,8 @@ * [Exposing SUSE Observability outside of the cluster](setup/install-stackstate/kubernetes_openshift/ingress.md) * [Initial run guide](setup/install-stackstate/initial_run_guide.md) * [Troubleshooting](setup/install-stackstate/troubleshooting.md) - * [Logs](configure/logging/kubernetes-logs.md) + * [Advanced Troubleshooting](setup/install-stackstate/advanced-troubleshooting.md) + * [Support Package (Logs)](setup/install-stackstate/support-package-logs.md) * [Configure SUSE Observability](setup/configure-stackstate/README.md) * [Slack notifications](setup/configure-stackstate/slack-notifications.md) * [E-mail notifications](setup/configure-stackstate/email-notifications.md) diff --git a/setup/install-stackstate/advanced-troubleshooting.md b/setup/install-stackstate/advanced-troubleshooting.md new file mode 100644 index 000000000..b02ab5bf2 --- /dev/null +++ b/setup/install-stackstate/advanced-troubleshooting.md @@ -0,0 +1,109 @@ +--- +description: SUSE Observability Self-hosted +--- + +# Advanced Troubleshooting + +When you are a prime customer, reach out to SUSE Observability support at [https://scc.suse.com/](https://scc.suse.com/) to get help setting up SUSE Observability in your local cluster. Use [Support Package (Logs)](/setup/install-stackstate/support-package-logs.md) to collect information about your instance for the support team. + +This page provides detailed information on the subsystems of the SUSE Observability platform to troubleshoot deployment and operational issues. This page should only be consulted when the steps in the [troubleshooting](/setup/install-stackstate/troubleshooting.md) do not yield a solution. + +## General troubleshooting approach + +The general approach to troubleshooting operational issues of the SUSE Observability platform, is the following: + +- Getting an overview of how the pods are behaving through `kubectl get pods` +- Use the detailed subsystem information in this document, together with the symptoms of the problem, to determine which pods/subsystems might be the root cause +- Inspect the logs/metadata of the suspected pods through: + - `kubectl logs --all-containers=true` + - `kubectl describe pod ` + - A quick way to get all related logs/description related to SUSE Observability is through the [Support Package (Logs)](/setup/install-stackstate/support-package-logs.md). +- It might be the logs point to some dependency misbehaving, in this case investigate the dependency. + +## Overview of subsystems + +### Databases + +SUSE Observability is powered by various databases, whenever a database is misbehaving, this should be investigated first because all other services depend on it + +- `Zookeeper`: Zookeeper is used for service discovery, orchestration and failover. Zookeeper is deployed using 1 or more pods with the name: + - `suse-observability-zookeeper-` +- `Kafka`: Kafka is used for message passing between almost all services: Kafka is deployed by the following pods: + - `suse-observability-kafka-`: Main kafka deployment + - `-kafkaup-operator-kafkaup-*`: Helper operator performing kafka upgrades +- `StackGraph`: StackGraph stored (user-)settings and the topology. StackGraph is built out of multiple components and has 2 deployment modes. HA and nonHA. + - `Tephra`: Manages database transaction starts, commits and conflicts. Served by pod `-hbase-tephra-` + - `-hbase-tephra-`: Tephra transaction server pod. Keeps track of transactions and conflicts. + - `HBase-HA`: Stores the StackGraph data, spread over multiple pods with different responsibilities: + - `-hbase-hdfs-nn-0`: Name-node for HDFS, keeps track of file index + - `-hbase-hdfs-snn-0`: Secondary name-node, does cleanup work after the name-node + - `-hbase-hdfs-dn-`: HDFS Datanode, stores the actual data + - `-hbase-hbase-master-`: HBase Master, coordinates tables and regions + - `-hbase-hbase-rs-`: HBase Region Server, serves tables and regions, stores its data on HDFS + - `HBase-non-HA`: + - `-hbase-stackgraph-0`: All StackGraph components deployed as a single pod in `non-HA` setup. This also includes its own zookeeper instance. +- `VictoriaMetrics`: Stores metric data. Is deployed by the pods: + - `suse-observability-victoria-metrics--0`: Main VictoriaMetrics data store/query node + - `suse-observability-vmagent-0`: Ingestion agent for VictoriaMetrics. Data is pushed to vmagent before being forwarded and stored. +- `ClickHouse`: Stores trace data. Deployed by the following pod(s): + - `suse-observability-clickhouse-shard0-`: Main clickhouse store +- `ElasticSearch`: Stores events and logs. Deployed by the following pods: + - `suse-observability-elasticsearch-master-`: Main Elasticsearch store + - `-prometheus-elasticsearch-exporter-*`: Exports performance metrics of the Elasticsearch instances + +### Ingestion services + +SUSE Observability platform gets data pushed by the agent and OpenTelemetry (OTEL) agent. The ingestion services perform initial processing and bring the data to storage. + +- `Receiver`: The receiver implements the collection-side API for the SUSE Observability agent. It accepts and authorizes telemetry data (logs, events, metrics or topology) and forwards it to the corresponding datastore or Kafka. It can be deployed in single or split mode: + - `Receiver-Split`: + - `-suse-observability-receiver-logs-*`: Receives logs and puts them into Elasticsearch + - `-suse-observability-receiver-process-agent-*`: Receives process and network connectivity information and forwards it to Kafka topics + - `-suse-observability-receiver-base`-*: All other SUSE Observability Agent data comes through here. + - `Receiver-NonSplit`: + - `-suse-observability-receiver-*`: All SUSE Observability Agent data comes through here. +- `OpenTelemetry Collector`: Provides an endpoint OpenTelemetry agents can push OpenTelemetry data to and produces traces, metrics and topology based on the pushed data. + - `suse-observability-otel-collector-0`: Single pod implementing the OTEL collector + +### Processing and serving + +SUSE Observability platform preforms correlation and monitoring on the telemetry data it received. The results of the are served to the customer on demand through the API. The core platform can be ran in distributed and non-distributed mode. Distributed allows for higher throughput. + +- `Correlator`: Correlates tcp connection information to turn it into topology. Implemented by pod: + - `-suse-observability-correlate-*` +- `Events2Elasticsearch`: Processes events and stores them in Elasticsearch: Implemented by pod: + - `-suse-observability-e2es-*` +- `Anomaly Detection`: The SUSE Observability platform does anomaly detection (disabled by default) on metrics, producing health violations: + - `-anomaly-detection-spotlight-manager-*`: Distributed anomaly detection work + - `-anomaly-detection-spotlight-worker-*`: Performs anomaly detection on metric streams +- `Platform-Distributed`: The platform contains the main processing components and serving api. In distributed mode functional units are split out. The pods that belong to the platform: + - `-suse-observability-api-*`: Serves all data to the user and manages StackPack installation/deinstallation. + - `-suse-observability-checks-*`: Runs the monitors + - `-suse-observability-health-sync-*`: Processes health (violation) information from monitors and the SUSE Observability Agent and attaches it to topology. + - `-suse-observability-initializer-*`: Coordinates initialization of the datastores and migrations + - `-suse-observability-notification-*`: Forwards notifications based on health violations and user setting to downstream systems like Slack/Opsgenie. + - `-suse-observability-slicing-*`: Continuously optimizes the topology history for quick retrieval + - `-suse-observability-state-*`: Processing health violations and aggregates them into component health + - `-suse-observability-sync-*`: Processes topology data combined with user settings and turns it into the topology graph. +- `Platform-Mono`: + - `-suse-observability-server-*`: Contains all functionality of the `Platform-Distributed` setup but in a single pod. + +### Miscellaneous + +- `Routing`: Accept connections and route to the right backend service: + - `-suse-observability-router-`: Router based on Envoy +- `UI`: React-based UI + - `-suse-observability-ui`: Serves just the static UI code and assets, all dynamic behavior is done by the `api` +- `Backup/Restore`: Periodically run jobs to backup the various data stores. Has one continuously running pod: + - `suse-observability-minio-*`: Provides an abstract interface for interacting with backup storage. + +## Relations between subsystems + +To effectively find the root cause of a problem, it is important to understand what pods are dependent on others when deployed. The following diagram shows an overview of the pods with TCP connections that can exist between them. When looking for a root cause it makes sense to look to the pod that is 'lowest' in this dependency chain. + +The pod name in this diagram are abbreviated for brevity. + +![Pod TCP Dependencies](../../.gitbook/assets/Pod_TCP_diagram.png) + + + diff --git a/configure/logging/kubernetes-logs.md b/setup/install-stackstate/support-package-logs.md similarity index 95% rename from configure/logging/kubernetes-logs.md rename to setup/install-stackstate/support-package-logs.md index d2d3516c8..78b07558f 100644 --- a/configure/logging/kubernetes-logs.md +++ b/setup/install-stackstate/support-package-logs.md @@ -1,8 +1,8 @@ --- -description: SUSE Observability Self-hosted v5.1.x +description: SUSE Observability Self-hosted --- -# SUSE Observability logs +# SUSE Observability Support Package (logs) ## Overview diff --git a/setup/install-stackstate/troubleshooting.md b/setup/install-stackstate/troubleshooting.md index 4160ab8bd..82f615ada 100644 --- a/setup/install-stackstate/troubleshooting.md +++ b/setup/install-stackstate/troubleshooting.md @@ -40,10 +40,9 @@ Here is a quick guide for troubleshooting the startup of SUSE Observability: The output contains an `event` section at the end which usually contains the problem. It also has a `State` section for each container that has more details for termination of the container. -3. [Check the logs](/configure/logging/kubernetes-logs.md) for errors. -4. Check the Knowledge base on the [SUSE Observability Support site](https://support.stackstate.com/). +3. When you are a prime customer, reach out to SUSE Observability support at [https://scc.suse.com/](https://scc.suse.com/) to get help setting up SUSE Observability in your local cluster. Use [Support Package (Logs)](/setup/install-stackstate/support-package-logs.md) to collect information about your instance for the support team. +4. In case the above steps did not resolve the issue, there is an [Advanced Troubleshooting Guide](./advanced-troubleshooting.md) available. + -## Known issues -Check the [SUSE Observability support Knowledge base](https://support.stackstate.com/hc/en-us/sections/360004684540-Known-issues) to find troubleshooting steps for all known issues.