Skip to content

Commit

Permalink
feat: adjust average utilization for metadata autoscaling
Browse files Browse the repository at this point in the history
  • Loading branch information
Mikołaj Świątek committed Jan 4, 2023
1 parent 25e5279 commit a1f7ab1
Show file tree
Hide file tree
Showing 5 changed files with 66 additions and 22 deletions.
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,11 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- chore: upgrade Fluentd to v1.15.3-sumo-0 [#2745]
- This also upgrades Ruby from `v2.7` to `v3.1` and some other dependencies.
See [v1.15.3-sumo-0] for more.
- feat: adjust average utilization for metadata autoscaling [#2744]

[#2724]: https://github.com/SumoLogic/sumologic-kubernetes-collection/pull/2724
[#2745]: https://github.com/SumoLogic/sumologic-kubernetes-collection/pull/2745
[#2744]: https://github.com/SumoLogic/sumologic-kubernetes-collection/pull/2744
[v1.15.3-sumo-0]: https://github.com/SumoLogic/sumologic-kubernetes-fluentd/releases/tag/v1.15.3-sumo-0
[Unreleased]: https://github.com/SumoLogic/sumologic-kubernetes-collection/compare/v3.0.0-beta.0...main

Expand Down
4 changes: 2 additions & 2 deletions deploy/helm/sumologic/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -510,7 +510,7 @@ The following table lists the configurable parameters of the Sumo Logic chart an
| `metadata.metrics.autoscaling.enabled` | Option to turn autoscaling on for metrics metadata enrichment (otelcol) and specify params for HPA. Autoscaling needs metrics-server to access cpu metrics. | `false` |
| `metadata.metrics.autoscaling.minReplicas` | Default min replicas for autoscaling. | `3` |
| `metadata.metrics.autoscaling.maxReplicas` | Default max replicas for autoscaling | `10` |
| `metadata.metrics.autoscaling.targetCPUUtilizationPercentage` | The desired target CPU utilization for autoscaling. | `100` |
| `metadata.metrics.autoscaling.targetCPUUtilizationPercentage` | The desired target CPU utilization for autoscaling. | `80` |
| `metadata.metrics.autoscaling.targetMemoryUtilizationPercentage` | The desired target memory utilization for autoscaling. | `Nil` |
| `metadata.metrics.podDisruptionBudget` | Pod Disruption Budget for metrics metadata enrichment (otelcol) statefulset. | `{"minAvailable": 2}` |
| `metadata.logs.enabled` | Flag to control deploying the otelcol logs statefulsets. | `true` |
Expand All @@ -537,7 +537,7 @@ The following table lists the configurable parameters of the Sumo Logic chart an
| `metadata.logs.autoscaling.enabled` | Option to turn autoscaling on for logs metadata enrichment (otelcol) and specify params for HPA. Autoscaling needs metrics-server to access cpu metrics. | `false` |
| `metadata.logs.autoscaling.minReplicas` | Default min replicas for autoscaling. | `3` |
| `metadata.logs.autoscaling.maxReplicas` | Default max replicas for autoscaling | `10` |
| `metadata.logs.autoscaling.targetCPUUtilizationPercentage` | The desired target CPU utilization for autoscaling. | `100` |
| `metadata.logs.autoscaling.targetCPUUtilizationPercentage` | The desired target CPU utilization for autoscaling. | `80` |
| `metadata.logs.autoscaling.targetMemoryUtilizationPercentage` | The desired target memory utilization for autoscaling. | `Nil` |
| `metadata.logs.podDisruptionBudget` | Pod Disruption Budget for logs metadata enrichment (otelcol) statefulset. | `{"minAvailable": 2}` |
| `otelevents.image.repository` | Image repository for otelcol docker container. | `public.ecr.aws/sumologic/sumologic-otel-collector` |
Expand Down
4 changes: 2 additions & 2 deletions deploy/helm/sumologic/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3874,7 +3874,7 @@ metadata:
enabled: false
minReplicas: 3
maxReplicas: 10
targetCPUUtilizationPercentage: 100
targetCPUUtilizationPercentage: 80
# targetMemoryUtilizationPercentage: 50

## Option to specify PodDisrutionBudgets
Expand Down Expand Up @@ -3974,7 +3974,7 @@ metadata:
enabled: false
minReplicas: 3
maxReplicas: 10
targetCPUUtilizationPercentage: 100
targetCPUUtilizationPercentage: 80
# targetMemoryUtilizationPercentage: 50

## Option to specify PodDisrutionBudgets
Expand Down
45 changes: 44 additions & 1 deletion docs/best-practices.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,47 @@ See the chart's [README](/deploy/helm/sumologic/README.md) for all the available

## OpenTelemetry Collector Autoscaling

:construction: *TODO*, see [the FluentD section](fluent/best-practices.md#fluentd-autoscaling)
OpenTelemetry Collector StatefulSets used for metadata enrichment can be autoscaled. This is disabled by default,
for reasons outlined in [#2751].

Whenever your OpenTelemetry Collector pods CPU consumption is close to the limit, you could experience a [delay in data ingestion
or even data loss](/docs/monitoring-lag.md) under higher-than-normal load. In such cases you should enable autoscaling.

To enable autoscaling for OpenTelemetry Collector:

- Enable metrics-server dependency

Note: If metrics-server is already installed, this step is not required.

```yaml
## Configure metrics-server
## ref: https://github.com/bitnami/charts/tree/master/bitnami/metrics-server/values.yaml
metrics-server:
enabled: true
```

- Enable autoscaling for log metadata enrichment

```yaml
metadata:
logs:
autoscaling:
enabled: true
```

- Enable autoscaling for Metrics Fluentd statefulset

```yaml
metadata:
metrics:
autoscaling:
enabled: true
```

It's also possible to adjust other autoscaling configuration options, like the maximum number of replicas or
the average CPU utilization. Please refer to the [chart readme][chart_readme] and [default values.yaml][values.yaml] for details.

[#2751]: https://github.com/SumoLogic/sumologic-kubernetes-collection/issues/2751

## OpenTelemetry Collector File-Based Buffer

Expand Down Expand Up @@ -597,3 +637,6 @@ fluent-bit:
In order to parse and store log content as json following configuration has to be applied:

:construction: *TODO*, see [the FluentD section](fluent/best-practices.md#parsing-log-content-as-json)

[chart_readme]: ../deploy/helm/sumologic/README.md
[values.yaml]: ../deploy/helm/sumologic/values.yaml
33 changes: 16 additions & 17 deletions docs/monitoring-lag.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,10 @@
Once you have Sumo Logic's collection setup installed, you should be primed
to have metrics, logs, and events flowing into Sumo Logic.

However, as your cluster scales up and down, you might find the need to rescale
your Fluentd deployment replica count.
However, as your cluster scales up and down, you might find the need to scale
your metadata enrichment Statefulset appropriately. To that end, you may need to
enable autoscaling for metadata enrichment and potentially tweak its settings.

Here are some tips on how to judge if you're seeing lag in your Sumo Logic collection
pipeline.

Expand All @@ -13,28 +15,25 @@ pipeline.
This dashboard can be found from the `Cluster` level in `Explore`, and is a great way
of holistically judging if your collection process is working as expected.

1. Fluentd CPU usage

Whenever your Fluentd pods CPU consumption is near the limit you could experience
a delay in data ingestion or even a data loss in extreme situations. Many times it is
caused by unsufficient amount of Fluentd instances being available. We advise to use the
[Fluentd autoscaling](https://github.com/SumoLogic/sumologic-kubernetes-collection/blob/release-v2.5/deploy/docs/Best_Practices.md#fluentd-autoscaling)
with help of Horizontal Pod Autoscaler to mitigate this issue.
1. OpenTelemetry Collector CPU usage

Also if the HPA is enabled but the maximum number of instances (configured in the HPA)
has been reached, it may cause a delay.
To mitigate this please increase the maximum number of instances for the HPA.
Whenever your OpenTelemetry Collector Pods' CPU consumption is near the limit you could experience
a delay in data ingestion or even a data loss in extreme situations. Usually this is
caused by insufficient amount of OpenTelemetry Collector instances being available. In
that case, consider [enabling autoscaling](./best-practices.md#opentelemetry-collector-autoscaling).
If autoscaling is already enabled, increase `maxReplicas` until the average CPU usage normalizes.

1. Fluentd Queue Length
1. OpenTelemetry Collector Queue Length

On the health check dashboard you'll see a panel for Fluentd queue length.
The `otelcol_exporter_queue_size` metric can be used to monitor the length of the on-disk
queue OpenTelemetry Collector uses for outgoing data.
If you see this length going up over time, chances are that you either have backpressure
or you are overwhelming the Fluentd pods with requests.
If you see any `429` status codes in the Fluentd logs, that means you are likely
or you are overwhelming the OpenTelemetry Collector pods with requests.
If you see any `429` status codes in the OpenTelemetry Collector logs, that means you are likely
getting throttled and need to contact Sumo Logic to increase your base plan
or increase the throttling limit.
If you aren't seeing `429` then you likely are in a situation where the incoming traffic
into Fluentd is higher than the current replica count can handle.
into OpenTelemetry Collector is higher than the current replica count can handle.
This is a good indication that you should scale up.

1. Check Prometheus Remote Write Metrics
Expand Down

0 comments on commit a1f7ab1

Please sign in to comment.