diff --git a/docs/metrics/metrics-queries/index.md b/docs/metrics/metrics-queries/index.md index 914915b175..de6e3fa29b 100644 --- a/docs/metrics/metrics-queries/index.md +++ b/docs/metrics/metrics-queries/index.md @@ -66,4 +66,10 @@ In this section, we'll introduce the following concepts:
Learn how to share a saved or unsaved metric query.
+Learn tips for getting the most out of your metric queries.
+
+
+This screenshot shows the rollup types of `min`, `max`, `latest`, `avg`, `sum`, and `count`:
+
+
+
+### Pod count quantize example
+
+Suppose you want to find the latest value for total pods in a cluster regardless of the run state. Each pod has 5 metric series (one for each possible pod state tag) with a value from 0 to *n* being the number of pods in that state:
+
+Note that when you sum the columns, the sum of `max` is 2, sum of `avg` would be 1.5, and the sum of `count` would be 10.
+
+The correct query is to max each series and then sum them:
+```
+cluster=prod metric=kube_pod_status_phase
+| quantize to 1m using max drop last | sum
+```
+
+
+The wrong approach is to average the rollup and count of metric series:
+
+
+
+### Changing statistic type on a chart changes results
+
+Changing the **Statistic Type** on a chart (if that option is present) changes the output of the query that is displayed. Always consider if the default of **Average** is correct:
+
+
+
+For example, selecting the **Average** statistic type for the following query yields a sum of `75.63`:
+
+But selecting the **Sum** statistic type for the same query yields a sum of `1,739.5`:
+
+## Metric discovery
+
+### Autocomplete
+
+Metric and tag discovery is a key activity in creating metric queries. [Metrics Search](/docs/metrics/metrics-queries/metrics-explorer/) provides an autocomplete function to make it fast and easy to build out metric queries. Tag names and values are supplied in the query UI based on the current metric query scope.
+
+Autocomplete can show you:
+* Which metrics exist for a specific scope:
+* What tag names exist for a specific metric or tag scope:
+* What values exist for a specific tag in the current scope:
+
+### Time series
+
+The **Time Series** view lets you review the metric time series and tag values when the query outputs raw metric query (with no aggregation). Use this view to understand what many metric series are in the current scope and check how many values appear in tag columns or understand higher than expected cardinality values in tags.
+
+Switch to the **Time Series** tab to see metrics, tags and tag values:
+* Use `//` to comment out the aggregate part of query to jump back to raw time series view.
+* Columns are sortable.
+* Select the ellipsis button on any tag value in the grid to quickly add filter statements.
+
+
+
+### Time series example
+
+In the following example, we use the comment tag `//` to view the metric query that is still raw (no aggregation) in the **Time Series** tab:
+
+```
+container="istio-proxy" node="ip-10-42-169-62.us-west-2.compute.internal" metric=container_memory_working_set_bytes cluster=prod namespace=prod-otel001
+| quantize 1m // | avg by container, pod | sum by pod
+```
+
+
+Notice the following:
+* The `container`, `namespace`, and `pod` labels for the metric from Kubernetes (`https://kubernetes.io/docs/reference/instrumentation/metrics/`).
+* There is one metric in scope for `cluster` and `node`.
+* There are 17 time series, so one or more tags must have unique values.
+* The `deployment` and `pod` columns both have 17 values (one per istio endpoint).
+* Use `avg` quantization (default), then `sum` for final total.
+* Don't use `sum` (total for all data points) or `count` (count of data points only).
+
+## Charting metrics
+
+### Metric chart types
+
+Unlike log searches, you often don't need to format the query output to make different types or charts. Some panels like **Single Value** do need aggregate output formatted a certain way.
+
+For metrics, the UI has a very large impact on the resulting chart (compared to log search charting). The same query can produce many types of charts even when not aggregated (unlike in logs).
+
+### Aggregate in metrics
+
+Good reasons to aggregate in metrics:
+* Better control over resulting time series from query.
+* Much easier to get the query output you want to chart.
+* Queries will scale to larger numbers of series and data points.
+
+You can't have more than 1000 groups in aggregation or raw data series. This is called the "output limit". For more information, see [Output data limit](/docs/metrics/metrics-queries/metric-query-error-messages/#output-data-limit).
+
+Try the following fixes:
+* Perform a final aggregation by a dimension with less than 1000 groups:
+
+* Group or sort can be a UI setting in some charts:
+* Threshold chart coloring can be a UI setting:
+
+### Overrides to improve series naming
+
+Use the **Display Overrides** tab to make it easy to read alias for metric series names in legends and popups, as well as the more usual options for custom chart formatting, such as the left/right axis. For more information, see [Override dashboard displays](/docs/dashboards/panels/modify-chart/#overridedashboard-displays).
+
+Following are examples of queries with and without override.
+
+#### Default raw series
+
+Query:
+ ```
+ metric=kube_pod_container_status_restarts_total
+ | quantize using max | delta counter
+ | topk(100,max)
+ ```
+
+Override: None
+
+Output:
+
+
+#### Aggregated
+
+Query:
+ ```
+ metric=kube_pod_container_status_restarts_total
+ | quantize using max | delta counter
+ | topk(100,max) | sum by pod,cluster,namespace
+ ```
+Override: None
+
+Output:
+
+
+#### Aggregated with override alias on #A
+
+Now see how an override makes it easier to read output.
+
+Query:
+ ```
+ metric=kube_pod_container_status_restarts_total
+ | quantize using max | delta counter
+ | topk(100,max)
+ | sum by pod,cluster,namespace
+ ```
+
+Override:
+
+
+Output:
+
+
+### Tabular stats for time series panels
+
+Use the **Bottom** or **Right** positions with table format and include aggregates like `sum`, `avg`, and `latest` to enable summary stats in the same panel for time series charts.
+
+
+
+## Learn your ABC
+
+### ABC pattern
+
+To compute a third series C from two series A & B is the ABC pattern. Common examples are:
+* % usage where CPU, memory, or disk full % where metrics capacity and and usage are sent separately.
+* Determine usage in Kubernetes metrics sent as multiple series such as requests versus limits for pods.
+
+When using the ABC pattern, keep in mind:
+* If this is `per x` it must be grouped correctly, for example, `sum by pod`. Take careful note of quantization period and type.
+* Create a #C series with required computation.
+* Make sure the quantization period is identical for all three series, or results will be very strange.
+* If grouping, you must include [`along`](/docs/metrics/metrics-operators/along/) in #C. For an example, see [Join Metrics Queries](/docs/metrics/introduction/joins/).
+
+### ABC example 1 - Disk usage % top 20
+
+Say we have two metrics and want to calculate a usage % for the top 20 fullest disk volumes:
+
+```
+// metric gives us bytes size per node, mount, device
+node="ip-10-42-177-158.us-west-2.compute.internal" cluster=prod metric=node_filesystem_size_bytes !fstype=tmpfs node=*
+| quantize to 5m using avg
+| avg by node,mountpoint,device
+```
+
+```
+// gives use the free bytes per node,mountpoint,device
+node="ip-10-42-177-158.us-west-2.compute.internal" cluster=prod metric=node_filesystem_avail_bytes !fstype=tmpfs node=*
+| quantize to 5m using avg
+| avg by node,mountpoint,device
+```
+
+Create an #A and #B series with the queries above and then add a #C series:
+```
+1 - (#B / #A) along node,mountpoint,device | eval _value * 100 | quantize to 5m using avg
+| topk(20, latest)
+```
+
+
+
+For our top list output we've chosen a bar chart, with only #C visible, with sort by value descending, and latest value shown. Note that `along` is very important:
+
+### ABC example 2 - Kubernetes memory versus limits by pod
+
+A query:
+```
+metric=container_memory_working_set_bytes
+cluster=prod namespace=prod-otel001
+ | quantize 1m | avg by container, pod | sum by pod
+```
+
+B query:
+```
+metric=kube_pod_container_resource_limits resource=memory
+cluster=prod namespace=prod-otel001
+| quantize 1m | sum by pod
+```
+
+C query:
+```
+#A / #B * 100 along pod
+| topk(50,max)
+```
+
+
+
+
+
+## Rates and counters
+
+Many metric values are sent as a cumulative counter, for example, `kube_pod_container_status_restarts_total`, or need to be graphed as a rate over time such as requests/sec.
+
+To graph, typically we want to calculate the rate of change and sum that in each time window using one of various possible syntaxes. A [`delta` operator](/docs/metrics/metrics-operators/delta/) often simplest to use, but there are lots of options for similar results:
+* Use [`quantize`](/docs/metrics/metrics-operators/quantize/) to calculate rates/time: `| quantize using rate`
+* Use [`delta`](/docs/metrics/metrics-operators/delta/) for increasing counters, for example, total restarts: `| delta counter`
+* Use [`rate`](/docs/metrics/metrics-operators/rate/) for increasing counter rate/time: `| rate increasing over 1m`.
+* Use [`sum`](/docs/metrics/metrics-operators/sum/) to turn value over X bucket into rate/time: `| sum | eval 1000 * _value / _granularity`
+
+### Counter example
+
+Following is a counter example showing pod restarts using the `kube_pod_container_status_restarts_total` metric:
+
+```
+metric=kube_pod_container_status_restarts_total
+| quantize using max | delta counter
+| topk(100,max)
+| sum by pod
+```
+
+Note that:
+* `quantize` using `max` not `avg` (default), or you will get odd fractional results.
+* `delta` is the easiest operator to measure counter change. `rate` will produce fractions not integer.
+* `sum` the delta of the max rollup, and use a stacked bar for an insightful view.
+
+Following is the correct way to structure your query to show pod restarts. Use `max` with `quantize`, the `delta` operator in counter mode, and `topk` to show only 100 most restarts. The stacked chart clearly shows the restart rate by pod.
+
+Following is an incorrect way to query for pod restarts. It uses `sum` over time for the cumulative total of restarts per container.
+
+## Managing DPM and high cardinality
+
+### Understanding DPM drivers
+
+Metrics are billed in data points per minute (DPM), typically at a rate of 3 credits per 1000 DPM averaged over each 24 hour period. `DPM = metrics * entities * total cardinality` of all tag names/values, so sending more metric/tag combinations increases cost.
+
+We have three tools in Sumo Logic to track DPM usage and drill down into drivers of high metric cardinality:
+* The account page shows metric consumption per day over time:
+* The **Metrics** dashboard in the [Data Volume app](/docs/integrations/sumo-apps/data-volume/) can track high DPM consumption per metadata fields such as `collector`, `source`, and `sourcecategory`:})
+* [Metrics Data Ingestion](/docs/metrics/metrics-dpm/) is a filterable admin UI to show detailed DPM and cardinality per metric name or tag. Advanced users can make custom log searches versus the underlying audit indexes for this source.})
+
+### Metric DPM versus credits
+
+More data points require more credits, and therefore, more cost.
+
+Key factors that result in higher DPM are:
+* [Number of hosts, sources, or entities](#number-of-hosts-sources-or-entities)
+* [Increase in cardinality](#increase-in-cardinality)
+* [Increase in number of metrics](#increase-in-number-of-metrics)
+* [Frequency of data points](#frequency-of-data-points)
+
+Excessively high cardinality for a metric or single tag might trigger rate limiting or blocking of that metric. See [Disabled Metrics Sources](/docs/metrics/manage-metric-volume/disabled-metrics-sources/).
+
+#### Number of hosts, sources, or entities
+
+Having more of something sending metrics means more DPM. For example, ff you sent 50 metrics per host and have 1000 hosts, that will be 50,000 DPM.
+
+In some use cases you can filter out unwanted hosts. For example, you could filter by Kubernetes namespace or AWS tag to reduce the total number of metric entities.
+
+#### Increase in cardinality
+
+Sending dimension (tag) values for metrics with high cardinality will result in high data points per minute for that metric. Also high cardinality can make analyzing and interpreting metrics difficult.
+
+The cardinality (and hence DPM) of a single metric is the product of all possible cardinalities of tags. For example:
+
+* host x10 and service x20 = up to 200 DPM
+* host x10 and service x20 and url_path x 5000 = up to 10,000 DPM
+* host x10 and service x20 and thread_id x1000s and customer id and x2M customers = millions (very likely to get rate limited)
+* host x10 and service x20 and epoc nano sec `1737579476122` = every data point becomes a unique time series producing millions (very likely to get rate limited)
+
+Ideally, tag value cardinality should be low (less than 100). If you need very high cardinality, logs are usually a better option. Log query engine is much more flexible and can handle very high cardinality by design.
+
+Never send high cardinality tag values where it's likely to have more than a few 1000 values per tag such as:
+* Timestamps or epoch times
+* User or customer IDs or usernames (where there could be more than 50k possible values, possibly millions)
+* Query string, for example, on a search form where any search text is a tag
+URL or paths with high cardinality such as unique per customer URLs (for example, `http://foo.bar/473639029/account`
+
+At best, metrics with 10k of tag values will be expensive to ingest and hard to query, and at worst, they will be blocked.
+
+#### Increase in number of metrics
+
+If you see an increase in the number of metric series, look for new metric sources onboarded and consider filtering the metrics sent. Most metrics pipelines offer configuration options to filter the metrics sent. For example, open source metrics systems like Telegraf offer plugin configuration to include or exclude metrics rather than send all of them, for example:
+* [Metrics filtering in a Kubernetes collection](https://github.com/SumoLogic/sumologic-kubernetes-collection/blob/main/docs/filtering.md#metrics)
+* [Using Telegraf filtering such as `fieldpass`](https://github.com/rjury-sumo/sumo-telegraf-examples/blob/main/config/linux.telegraf.conf) to send only valuable metrics, not all metrics in the plugin (default).
+
+#### Frequency of data points
+
+Sending one data point per minute will make one DPM per entity x tag cardinality. But sending four data points per minute (every 15s) will make four DPM per entity x tag cardinality. Check the default frequency of metric sources and reduce the frequency to reduce DPM.
+
+A common use case is reducing scape interval so 1m or 2m in Kubernetes collection. Steps for Kubernetes collection v4 (OpenTelemetry) and v3 (Prometheus) can be found [here](/docs/send-data/kubernetes/best-practices/#changing-scrape-interval-for-opentelemetry-metrics-collection).
diff --git a/sidebars.ts b/sidebars.ts
index aa52efc5b5..83111c0433 100644
--- a/sidebars.ts
+++ b/sidebars.ts
@@ -1710,6 +1710,7 @@ module.exports = {
'metrics/metrics-queries/aggregation-tips',
'metrics/metrics-queries/metric-query-error-messages',
'metrics/metrics-queries/share-metric-query',
+ 'metrics/metrics-queries/metric-query-best-practices',
],
},
{
diff --git a/static/img/metrics/metric-query-abc-example-1.png b/static/img/metrics/metric-query-abc-example-1.png
new file mode 100644
index 0000000000..0ab9286805
Binary files /dev/null and b/static/img/metrics/metric-query-abc-example-1.png differ
diff --git a/static/img/metrics/metric-query-abc-example-1a.png b/static/img/metrics/metric-query-abc-example-1a.png
new file mode 100644
index 0000000000..e86c1528f3
Binary files /dev/null and b/static/img/metrics/metric-query-abc-example-1a.png differ
diff --git a/static/img/metrics/metric-query-abc-example-2.png b/static/img/metrics/metric-query-abc-example-2.png
new file mode 100644
index 0000000000..b7920663e0
Binary files /dev/null and b/static/img/metrics/metric-query-abc-example-2.png differ
diff --git a/static/img/metrics/metric-query-abc-example-2a.png b/static/img/metrics/metric-query-abc-example-2a.png
new file mode 100644
index 0000000000..c298562270
Binary files /dev/null and b/static/img/metrics/metric-query-abc-example-2a.png differ
diff --git a/static/img/metrics/metric-query-aggregated-with-override.png b/static/img/metrics/metric-query-aggregated-with-override.png
new file mode 100644
index 0000000000..a47a5d9ee0
Binary files /dev/null and b/static/img/metrics/metric-query-aggregated-with-override.png differ
diff --git a/static/img/metrics/metric-query-aggregated.png b/static/img/metrics/metric-query-aggregated.png
new file mode 100644
index 0000000000..1e04660339
Binary files /dev/null and b/static/img/metrics/metric-query-aggregated.png differ
diff --git a/static/img/metrics/metric-query-chart-tips.png b/static/img/metrics/metric-query-chart-tips.png
new file mode 100644
index 0000000000..71292fd73d
Binary files /dev/null and b/static/img/metrics/metric-query-chart-tips.png differ
diff --git a/static/img/metrics/metric-query-correct-pod-restart.png b/static/img/metrics/metric-query-correct-pod-restart.png
new file mode 100644
index 0000000000..c2e4971c83
Binary files /dev/null and b/static/img/metrics/metric-query-correct-pod-restart.png differ
diff --git a/static/img/metrics/metric-query-default-raw.png b/static/img/metrics/metric-query-default-raw.png
new file mode 100644
index 0000000000..a751ea75bc
Binary files /dev/null and b/static/img/metrics/metric-query-default-raw.png differ
diff --git a/static/img/metrics/metric-query-display-override.png b/static/img/metrics/metric-query-display-override.png
new file mode 100644
index 0000000000..bffa725291
Binary files /dev/null and b/static/img/metrics/metric-query-display-override.png differ
diff --git a/static/img/metrics/metric-query-dpm-1.png b/static/img/metrics/metric-query-dpm-1.png
new file mode 100644
index 0000000000..4d6cbd7fff
Binary files /dev/null and b/static/img/metrics/metric-query-dpm-1.png differ
diff --git a/static/img/metrics/metric-query-dpm-2.png b/static/img/metrics/metric-query-dpm-2.png
new file mode 100644
index 0000000000..7aa009cb26
Binary files /dev/null and b/static/img/metrics/metric-query-dpm-2.png differ
diff --git a/static/img/metrics/metric-query-dpm-2a.png b/static/img/metrics/metric-query-dpm-2a.png
new file mode 100644
index 0000000000..abc1bb99a5
Binary files /dev/null and b/static/img/metrics/metric-query-dpm-2a.png differ
diff --git a/static/img/metrics/metric-query-dpm-3.png b/static/img/metrics/metric-query-dpm-3.png
new file mode 100644
index 0000000000..2e7bcc7c0f
Binary files /dev/null and b/static/img/metrics/metric-query-dpm-3.png differ
diff --git a/static/img/metrics/metric-query-dpm-3a.png b/static/img/metrics/metric-query-dpm-3a.png
new file mode 100644
index 0000000000..854aacbff1
Binary files /dev/null and b/static/img/metrics/metric-query-dpm-3a.png differ
diff --git a/static/img/metrics/metric-query-group-sort.png b/static/img/metrics/metric-query-group-sort.png
new file mode 100644
index 0000000000..f64c1ea589
Binary files /dev/null and b/static/img/metrics/metric-query-group-sort.png differ
diff --git a/static/img/metrics/metric-query-table-display.png b/static/img/metrics/metric-query-table-display.png
new file mode 100644
index 0000000000..bd8739d9fb
Binary files /dev/null and b/static/img/metrics/metric-query-table-display.png differ
diff --git a/static/img/metrics/metric-query-threshold.png b/static/img/metrics/metric-query-threshold.png
new file mode 100644
index 0000000000..b0ee1d5d2d
Binary files /dev/null and b/static/img/metrics/metric-query-threshold.png differ
diff --git a/static/img/metrics/metric-query-time-series.png b/static/img/metrics/metric-query-time-series.png
new file mode 100644
index 0000000000..7ef33af2d9
Binary files /dev/null and b/static/img/metrics/metric-query-time-series.png differ
diff --git a/static/img/metrics/metric-query-what-values.png b/static/img/metrics/metric-query-what-values.png
new file mode 100644
index 0000000000..89549a9ee0
Binary files /dev/null and b/static/img/metrics/metric-query-what-values.png differ
diff --git a/static/img/metrics/metric-query-which-metrics.png b/static/img/metrics/metric-query-which-metrics.png
new file mode 100644
index 0000000000..86438ceeb1
Binary files /dev/null and b/static/img/metrics/metric-query-which-metrics.png differ
diff --git a/static/img/metrics/metric-query-which-tag-names.png b/static/img/metrics/metric-query-which-tag-names.png
new file mode 100644
index 0000000000..3457485980
Binary files /dev/null and b/static/img/metrics/metric-query-which-tag-names.png differ
diff --git a/static/img/metrics/metric-query-wrong-pod-restart.png b/static/img/metrics/metric-query-wrong-pod-restart.png
new file mode 100644
index 0000000000..09e7dae5a0
Binary files /dev/null and b/static/img/metrics/metric-query-wrong-pod-restart.png differ
diff --git a/static/img/metrics/metrics-count-pods-correct.png b/static/img/metrics/metrics-count-pods-correct.png
new file mode 100644
index 0000000000..75dd0a3463
Binary files /dev/null and b/static/img/metrics/metrics-count-pods-correct.png differ
diff --git a/static/img/metrics/metrics-count-pods-wrong.png b/static/img/metrics/metrics-count-pods-wrong.png
new file mode 100644
index 0000000000..2731104c48
Binary files /dev/null and b/static/img/metrics/metrics-count-pods-wrong.png differ
diff --git a/static/img/metrics/metrics-quantize-statistic-type.png b/static/img/metrics/metrics-quantize-statistic-type.png
new file mode 100644
index 0000000000..6851cd623c
Binary files /dev/null and b/static/img/metrics/metrics-quantize-statistic-type.png differ
diff --git a/static/img/metrics/metrics-query-average-statistics-type.png b/static/img/metrics/metrics-query-average-statistics-type.png
new file mode 100644
index 0000000000..94b876a400
Binary files /dev/null and b/static/img/metrics/metrics-query-average-statistics-type.png differ
diff --git a/static/img/metrics/metrics-query-quantize-example.png b/static/img/metrics/metrics-query-quantize-example.png
new file mode 100644
index 0000000000..eed4543455
Binary files /dev/null and b/static/img/metrics/metrics-query-quantize-example.png differ
diff --git a/static/img/metrics/metrics-query-quantize-results.png b/static/img/metrics/metrics-query-quantize-results.png
new file mode 100644
index 0000000000..144708fe16
Binary files /dev/null and b/static/img/metrics/metrics-query-quantize-results.png differ
diff --git a/static/img/metrics/metrics-query-sum-statistic-type.png b/static/img/metrics/metrics-query-sum-statistic-type.png
new file mode 100644
index 0000000000..1cce364dfa
Binary files /dev/null and b/static/img/metrics/metrics-query-sum-statistic-type.png differ
diff --git a/static/img/metrics/time-series-example.png b/static/img/metrics/time-series-example.png
new file mode 100644
index 0000000000..d087b42b59
Binary files /dev/null and b/static/img/metrics/time-series-example.png differ
diff --git a/static/img/metrics/timeseries-error-message.png b/static/img/metrics/timeseries-error-message.png
new file mode 100644
index 0000000000..bb959bbdc1
Binary files /dev/null and b/static/img/metrics/timeseries-error-message.png differ