Skip to content

Commit

Permalink
doc: refactor Performance.md to fit lint
Browse files Browse the repository at this point in the history
  • Loading branch information
sumo-drosiek committed Feb 15, 2021
1 parent a10dc92 commit f92815f
Showing 1 changed file with 30 additions and 13 deletions.
43 changes: 30 additions & 13 deletions deploy/docs/Performance.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,9 @@
# Performance

For larger or more volatile loads, we recommend [enabling Fluentd autoscaling](./Best_Practices.md#Fluentd-Autoscaling), as this will allow Fluentd to automatically scale to support your data volume. However, the following recommendations and corresponding examples will help you get an idea of the resources required to run collection on your cluster.
For larger or more volatile loads, we recommend [enabling Fluentd autoscaling](./Best_Practices.md#Fluentd-Autoscaling),
as this will allow Fluentd to automatically scale to support your data volume.
However, the following recommendations and corresponding examples will help you get an idea of the resources
required to run collection on your cluster.

- [Recommendations](#recommendations)
- [Up to 500 application pods](#up-to-500-application-pods)
Expand All @@ -13,12 +16,18 @@ For larger or more volatile loads, we recommend [enabling Fluentd autoscaling](.

1. At least **8 Fluentd-logs** pods per **1 TB/day** of logs.
1. At least **4 Fluentd-metrics** pods per **120k DPM** of metrics.
1. The Prometheus pod will use on average **2GiB** memory per **120k DPM**; however in our experience this has gone up to **5GiB**, so we recommend allocating ample memory resources for the Prometheus pod if you wish to collect a high volume of metrics for a larger cluster.
1. For clusters with 500 application pods or greater, we found the following configuration changes in the [`values.yaml`](./../helm/sumologic/values.yaml) file to lead to a more stable experience:
- Increase the FluentBit in_tail `Mem_Buf_Limit` from 5MB to 64MB
- Set the `remote_timeout` to 1s (default 30s) for each item in the Prometheus remote write section under `kube-prometheus-stack.prometheus.prometheusSpec.remoteWrite`:

```
1. The Prometheus pod will use on average **2GiB** memory per **120k DPM**;
however in our experience this has gone up to **5GiB**,
so we recommend allocating ample memory resources for the Prometheus pod
if you wish to collect a high volume of metrics for a larger cluster.
1. For clusters with 500 application pods or greater,
we found the following configuration changes in the [`values.yaml`](./../helm/sumologic/values.yaml)
file to lead to a more stable experience:
- Increase the FluentBit in_tail `Mem_Buf_Limit` from 5MB to 64MB
- Set the `remote_timeout` to 1s (default 30s) for each item in the Prometheus
remote write section under `kube-prometheus-stack.prometheus.prometheusSpec.remoteWrite`:

```yaml
- url: http://$(FLUENTD_METRICS_SVC).$(NAMESPACE).svc.cluster.local:9888/prometheus.metrics.node
writeRelabelConfigs:
- action: keep
Expand All @@ -27,7 +36,9 @@ For larger or more volatile loads, we recommend [enabling Fluentd autoscaling](.
remoteTimeout: 1s
```

1. For clusters with 2000 application pods, we found that the **Fluentd-events** pod had to be given a 1 GiB memory limit to accommodate the increased events load. If you find that the **Fluentd-events** pod is being OOMKilled, please increase the memory limits and requests accordingly.
1. For clusters with 2000 application pods, we found that the **Fluentd-events** pod had to be given a 1 GiB memory limit
to accommodate the increased events load. If you find that the **Fluentd-events** pod is being OOMKilled,
please increase the memory limits and requests accordingly.
1. For our log generating test application pods, we found that increasing the IOPS to 300 minimum improved stability.

### Up to 500 application pods
Expand All @@ -36,31 +47,37 @@ Our test cluster had 71 nodes (50 `m4.large` instances, 21 `m5a.2xlarge` instanc

We used 125 GiB GP2 volumes, which allowed for IOPS of 375.

This cluster ran an average of 500 application pods, each generating either 128KB/s logs or 2400 DPM metrics. The application pods had about 20% churn rate.
This cluster ran an average of 500 application pods, each generating either 128KB/s logs or 2400 DPM metrics.
The application pods had about 20% churn rate.

Data type | Rate per pod | Min # pods | Max # pods | Max Total rate
--------- | ------------ | ---------- | ---------- | --------------
Logs | 128 KB/s | 50 pods | 400 pods | **4.3 TB/day**
Metrics | 2400 DPM | 100 pods | 450 pods | **1.3M DPM** (including non-application metrics)

We observed **35 Fluentd-logs** pods and **25 Fluentd-metrics** pods were sufficient for handling this load with the default resource limits.
We observed **35 Fluentd-logs** pods and **25 Fluentd-metrics** pods were sufficient
for handling this load with the default resource limits.

Prometheus memory consumption reached a maximum of **28GiB**, with an average of **16GiB**.

### Up to 2000 application pods

Our test cluster had 211 nodes (150 `m4.large` instances, 60 `m5a.2xlarge` instances, 1 `m5a.4xlarge` instance to accommodate the Prometheus pod's memory usage).
Our test cluster had 211 nodes (150 `m4.large` instances, 60 `m5a.2xlarge` instances,
1 `m5a.4xlarge` instance to accommodate the Prometheus pod's memory usage).

We used 125 GiB GP2 volumes, which allowed for IOPS of 375.

This cluster ran an average of 2000 application pods, each generating either 128KB/s logs or 2400 DPM metrics. The application pods had about 10% churn rate.
This cluster ran an average of 2000 application pods, each generating either 128KB/s logs or 2400 DPM metrics.
The application pods had about 10% churn rate.

Data type | Rate per pod | Min # pods | Max # pods | Max Total rate
--------- | ------------ | ---------- | ---------- | --------------
Logs | 128 KB/s | 900 pods | 1875 pods | **20 TB/day**
Metrics | 2400 DPM | 125 pods | 1100 pods | **3.2M DPM** (including non-application metrics)

We observed **135 Fluentd-logs** pods and **100 Fluentd-metrics** pods were sufficient for handling this load with the default resource limits. Additionally the **Fluentd-events** pod had to be given a 1 GiB memory limit to accommodate the increased events load.
We observed **135 Fluentd-logs** pods and **100 Fluentd-metrics** pods were sufficient for handling this load
with the default resource limits.
Additionally the **Fluentd-events** pod had to be given a 1 GiB memory limit to accommodate the increased events load.

Prometheus memory consumption reached a maximum of **60GiB**, with an average of **45GiB**.

Expand Down

0 comments on commit f92815f

Please sign in to comment.