From a1f7ab1f801d2d724b27e07427d797257acb5849 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Miko=C5=82aj=20=C5=9Awi=C4=85tek?= Date: Mon, 2 Jan 2023 17:43:57 +0100 Subject: [PATCH] feat: adjust average utilization for metadata autoscaling --- CHANGELOG.md | 2 ++ deploy/helm/sumologic/README.md | 4 +-- deploy/helm/sumologic/values.yaml | 4 +-- docs/best-practices.md | 45 ++++++++++++++++++++++++++++++- docs/monitoring-lag.md | 33 +++++++++++------------ 5 files changed, 66 insertions(+), 22 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 3ffa255a58..354579a16e 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -13,9 +13,11 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 - chore: upgrade Fluentd to v1.15.3-sumo-0 [#2745] - This also upgrades Ruby from `v2.7` to `v3.1` and some other dependencies. See [v1.15.3-sumo-0] for more. +- feat: adjust average utilization for metadata autoscaling [#2744] [#2724]: https://github.com/SumoLogic/sumologic-kubernetes-collection/pull/2724 [#2745]: https://github.com/SumoLogic/sumologic-kubernetes-collection/pull/2745 +[#2744]: https://github.com/SumoLogic/sumologic-kubernetes-collection/pull/2744 [v1.15.3-sumo-0]: https://github.com/SumoLogic/sumologic-kubernetes-fluentd/releases/tag/v1.15.3-sumo-0 [Unreleased]: https://github.com/SumoLogic/sumologic-kubernetes-collection/compare/v3.0.0-beta.0...main diff --git a/deploy/helm/sumologic/README.md b/deploy/helm/sumologic/README.md index e28826f659..85d2eaa5d1 100644 --- a/deploy/helm/sumologic/README.md +++ b/deploy/helm/sumologic/README.md @@ -510,7 +510,7 @@ The following table lists the configurable parameters of the Sumo Logic chart an | `metadata.metrics.autoscaling.enabled` | Option to turn autoscaling on for metrics metadata enrichment (otelcol) and specify params for HPA. Autoscaling needs metrics-server to access cpu metrics. | `false` | | `metadata.metrics.autoscaling.minReplicas` | Default min replicas for autoscaling. | `3` | | `metadata.metrics.autoscaling.maxReplicas` | Default max replicas for autoscaling | `10` | -| `metadata.metrics.autoscaling.targetCPUUtilizationPercentage` | The desired target CPU utilization for autoscaling. | `100` | +| `metadata.metrics.autoscaling.targetCPUUtilizationPercentage` | The desired target CPU utilization for autoscaling. | `80` | | `metadata.metrics.autoscaling.targetMemoryUtilizationPercentage` | The desired target memory utilization for autoscaling. | `Nil` | | `metadata.metrics.podDisruptionBudget` | Pod Disruption Budget for metrics metadata enrichment (otelcol) statefulset. | `{"minAvailable": 2}` | | `metadata.logs.enabled` | Flag to control deploying the otelcol logs statefulsets. | `true` | @@ -537,7 +537,7 @@ The following table lists the configurable parameters of the Sumo Logic chart an | `metadata.logs.autoscaling.enabled` | Option to turn autoscaling on for logs metadata enrichment (otelcol) and specify params for HPA. Autoscaling needs metrics-server to access cpu metrics. | `false` | | `metadata.logs.autoscaling.minReplicas` | Default min replicas for autoscaling. | `3` | | `metadata.logs.autoscaling.maxReplicas` | Default max replicas for autoscaling | `10` | -| `metadata.logs.autoscaling.targetCPUUtilizationPercentage` | The desired target CPU utilization for autoscaling. | `100` | +| `metadata.logs.autoscaling.targetCPUUtilizationPercentage` | The desired target CPU utilization for autoscaling. | `80` | | `metadata.logs.autoscaling.targetMemoryUtilizationPercentage` | The desired target memory utilization for autoscaling. | `Nil` | | `metadata.logs.podDisruptionBudget` | Pod Disruption Budget for logs metadata enrichment (otelcol) statefulset. | `{"minAvailable": 2}` | | `otelevents.image.repository` | Image repository for otelcol docker container. | `public.ecr.aws/sumologic/sumologic-otel-collector` | diff --git a/deploy/helm/sumologic/values.yaml b/deploy/helm/sumologic/values.yaml index 564ffdf8f2..97e4248d39 100644 --- a/deploy/helm/sumologic/values.yaml +++ b/deploy/helm/sumologic/values.yaml @@ -3874,7 +3874,7 @@ metadata: enabled: false minReplicas: 3 maxReplicas: 10 - targetCPUUtilizationPercentage: 100 + targetCPUUtilizationPercentage: 80 # targetMemoryUtilizationPercentage: 50 ## Option to specify PodDisrutionBudgets @@ -3974,7 +3974,7 @@ metadata: enabled: false minReplicas: 3 maxReplicas: 10 - targetCPUUtilizationPercentage: 100 + targetCPUUtilizationPercentage: 80 # targetMemoryUtilizationPercentage: 50 ## Option to specify PodDisrutionBudgets diff --git a/docs/best-practices.md b/docs/best-practices.md index a36cb051d8..7c8993416d 100644 --- a/docs/best-practices.md +++ b/docs/best-practices.md @@ -83,7 +83,47 @@ See the chart's [README](/deploy/helm/sumologic/README.md) for all the available ## OpenTelemetry Collector Autoscaling -:construction: *TODO*, see [the FluentD section](fluent/best-practices.md#fluentd-autoscaling) +OpenTelemetry Collector StatefulSets used for metadata enrichment can be autoscaled. This is disabled by default, +for reasons outlined in [#2751]. + +Whenever your OpenTelemetry Collector pods CPU consumption is close to the limit, you could experience a [delay in data ingestion +or even data loss](/docs/monitoring-lag.md) under higher-than-normal load. In such cases you should enable autoscaling. + +To enable autoscaling for OpenTelemetry Collector: + +- Enable metrics-server dependency + + Note: If metrics-server is already installed, this step is not required. + + ```yaml + ## Configure metrics-server + ## ref: https://github.com/bitnami/charts/tree/master/bitnami/metrics-server/values.yaml + metrics-server: + enabled: true + ``` + +- Enable autoscaling for log metadata enrichment + + ```yaml + metadata: + logs: + autoscaling: + enabled: true + ``` + +- Enable autoscaling for Metrics Fluentd statefulset + + ```yaml + metadata: + metrics: + autoscaling: + enabled: true + ``` + +It's also possible to adjust other autoscaling configuration options, like the maximum number of replicas or +the average CPU utilization. Please refer to the [chart readme][chart_readme] and [default values.yaml][values.yaml] for details. + +[#2751]: https://github.com/SumoLogic/sumologic-kubernetes-collection/issues/2751 ## OpenTelemetry Collector File-Based Buffer @@ -597,3 +637,6 @@ fluent-bit: In order to parse and store log content as json following configuration has to be applied: :construction: *TODO*, see [the FluentD section](fluent/best-practices.md#parsing-log-content-as-json) + +[chart_readme]: ../deploy/helm/sumologic/README.md +[values.yaml]: ../deploy/helm/sumologic/values.yaml diff --git a/docs/monitoring-lag.md b/docs/monitoring-lag.md index 572599f2bc..355d0e2360 100644 --- a/docs/monitoring-lag.md +++ b/docs/monitoring-lag.md @@ -3,8 +3,10 @@ Once you have Sumo Logic's collection setup installed, you should be primed to have metrics, logs, and events flowing into Sumo Logic. -However, as your cluster scales up and down, you might find the need to rescale -your Fluentd deployment replica count. +However, as your cluster scales up and down, you might find the need to scale +your metadata enrichment Statefulset appropriately. To that end, you may need to +enable autoscaling for metadata enrichment and potentially tweak its settings. + Here are some tips on how to judge if you're seeing lag in your Sumo Logic collection pipeline. @@ -13,28 +15,25 @@ pipeline. This dashboard can be found from the `Cluster` level in `Explore`, and is a great way of holistically judging if your collection process is working as expected. -1. Fluentd CPU usage - - Whenever your Fluentd pods CPU consumption is near the limit you could experience - a delay in data ingestion or even a data loss in extreme situations. Many times it is - caused by unsufficient amount of Fluentd instances being available. We advise to use the - [Fluentd autoscaling](https://github.com/SumoLogic/sumologic-kubernetes-collection/blob/release-v2.5/deploy/docs/Best_Practices.md#fluentd-autoscaling) - with help of Horizontal Pod Autoscaler to mitigate this issue. +1. OpenTelemetry Collector CPU usage - Also if the HPA is enabled but the maximum number of instances (configured in the HPA) - has been reached, it may cause a delay. - To mitigate this please increase the maximum number of instances for the HPA. + Whenever your OpenTelemetry Collector Pods' CPU consumption is near the limit you could experience + a delay in data ingestion or even a data loss in extreme situations. Usually this is + caused by insufficient amount of OpenTelemetry Collector instances being available. In + that case, consider [enabling autoscaling](./best-practices.md#opentelemetry-collector-autoscaling). + If autoscaling is already enabled, increase `maxReplicas` until the average CPU usage normalizes. -1. Fluentd Queue Length +1. OpenTelemetry Collector Queue Length - On the health check dashboard you'll see a panel for Fluentd queue length. + The `otelcol_exporter_queue_size` metric can be used to monitor the length of the on-disk + queue OpenTelemetry Collector uses for outgoing data. If you see this length going up over time, chances are that you either have backpressure - or you are overwhelming the Fluentd pods with requests. - If you see any `429` status codes in the Fluentd logs, that means you are likely + or you are overwhelming the OpenTelemetry Collector pods with requests. + If you see any `429` status codes in the OpenTelemetry Collector logs, that means you are likely getting throttled and need to contact Sumo Logic to increase your base plan or increase the throttling limit. If you aren't seeing `429` then you likely are in a situation where the incoming traffic - into Fluentd is higher than the current replica count can handle. + into OpenTelemetry Collector is higher than the current replica count can handle. This is a good indication that you should scale up. 1. Check Prometheus Remote Write Metrics