-
Notifications
You must be signed in to change notification settings - Fork 470
DOC-13170 Product Change- PR #143536 - metric: add /metrics endpoint with static labels #19823
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
In monitoring-and-alerting.md, moved info in the existing Prometheus endpoint section to the new page. In self-hosted-deployments.json, added link to new page.
✅ Deploy Preview for cockroachdb-interactivetutorials-docs canceled.
|
✅ Deploy Preview for cockroachdb-api-docs canceled.
|
✅ Netlify Preview
To edit notification comments on pull requests, go to your Netlify project configuration. |
…nd-alerting.md %}#prometheus-endpoint) with ({% link {{ page.version.version }}/prometheus-endpoint.md %}). Replace instances of (#prometheus-endpoint) with ({% link {{ page.version.version }}/prometheus-endpoint.md %}).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Prometheus endpoint doc looks great!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Just one question.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall. I left some suggestions, and questions to potentially consider.
@@ -175,6 +175,10 @@ Cockroach Labs recommends that you avoid _increasing_ the period of time that DB | |||
|
|||
### Disable time-series storage | |||
|
|||
{{site.data.alerts.callout_info}} | |||
Even if you rely on external tools for storing and visualizing your cluster's time-series metrics, CockroachDB continues to store time-series metrics for its [DB Console Metrics dashboards]({% link {{ page.version.version }}/monitoring-and-alerting.md %}#metrics-dashboards). These stored time-series metrics may be used to generate a [tsdump]({% link {{ page.version.version }}/cockroach-debug-tsdump.md %}), which is critical during escalations to Cockroach Labs support. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if you rely on external tools for storing and visualizing your cluster's time-series metrics, CockroachDB continues to store time-series metrics for its [DB Console Metrics dashboards]({% link {{ page.version.version }}/monitoring-and-alerting.md %}#metrics-dashboards). These stored time-series metrics may be used to generate a [tsdump]({% link {{ page.version.version }}/cockroach-debug-tsdump.md %}), which is critical during escalations to Cockroach Labs support. | |
Even if you rely on external tools for storing and visualizing your cluster's time-series metrics, CockroachDB continues to store time-series metrics for its [DB Console Metrics dashboards]({% link {{ page.version.version }}/monitoring-and-alerting.md %}#metrics-dashboards), unless you manually disable this collection. These stored time-series metrics may be used to generate a [tsdump]({% link {{ page.version.version }}/cockroach-debug-tsdump.md %}), which is critical during escalations to Cockroach Labs support. |
Edited for clarity and to fit better in the context of potentially disabling, however, I'm not sure why this is a callout—because we want customers to know we're still collecting data, which could have a storage cost, or because you may not want to do this, to preserve tsdump data that might be critical (I am also not sure about saying it "is critical" vs., say, "may be critical".)
In the next paragraph, we seem to say the opposite - it's almost implied that it's not critical
Disabling time-series storage is recommended only if you exclusively use a third-party tool such as [Prometheus]...
Do we mean to say that disabling time-series storage is an option if you exclusively use a third-party tool such as Prometheus, but even then, we recommend keeping it enabled in case it might help to provide it to CockroachDB Support during an issue?
(For that matter, why couldn't we just ask them to give us the data sourced from their third party tool; does it have less fidelity? Is that process/format less reliable?)
Just some food for thought to help inspire edits, or help ask SME/Support what they really care about and how they'd frame this.
In addition to using the exported time-series data to monitor a cluster through an external system, you can write alerting rules to ensure prompt notification of critical events or issues requiring intervention or investigation. Refer to [Essential Alerts]({% link {{ page.version.version }}/essential-alerts-self-hosted.md %}) for more details. | ||
{{site.data.alerts.end}} | ||
|
||
Even if you rely on external tools for storing and visualizing your cluster's time-series metrics, CockroachDB continues to store time-series metrics for its [DB Console Metrics dashboards]({% link {{ page.version.version }}/monitoring-and-alerting.md %}#metrics-dashboards). These stored time-series metrics may be used to generate a [tsdump]({% link {{ page.version.version }}/cockroach-debug-tsdump.md %}), which is critical during escalations to Cockroach Labs support. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a mention that it's possible to limit or disable this, here (e.g. to limit storage) and link out to the other page?
|
||
### Static labels | ||
|
||
Static labels allow segmentation of a metric across various facets for later querying and aggregation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw the later phrase "Another common scenario", which led me to realize I didn't grasp the first scenario, so here's an attempt to characterize/introduce that first scenario.
Static labels allow segmentation of a metric across various facets for later querying and aggregation. | |
Static labels allow segmentation of a metric across various facets for later querying and aggregation. | |
One common use of static labels is to support aggregation across related metric types. For example, rather than emitting separate metrics for inserts, selects, updates, and deletes, a single metric like `sql_count` can use a `query_type` label to distinguish among these operations. This enables operators to easily aggregate across query types (e.g., summing all SQL operations) or filter for a specific type using a label-based query. | |
The following tables contrast unlabeled metrics from the `_status/vars` endpoint with their labeled counterparts from the `metrics` endpoint: |
Another common scenario occurs when each label value represents a disjoint set of categories. An example here is the various certificate expiration metrics, which differ only by the specific certificate they refer to. Operators are unlikely to aggregate these, but may still want to view all certificate expiration metrics on a dashboard. | ||
|
||
For example, the output from the `metrics` endpoint will be similar to the following: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another common scenario occurs when each label value represents a disjoint set of categories. An example here is the various certificate expiration metrics, which differ only by the specific certificate they refer to. Operators are unlikely to aggregate these, but may still want to view all certificate expiration metrics on a dashboard. | |
For example, the output from the `metrics` endpoint will be similar to the following: | |
In other cases, label values can represent distinct categories not meant to be aggregated. For example, certificate expiration metrics differ only by the specific certificate type they refer to. Operators are unlikely to sum or average these, but may still want to display them side by side on a dashboard for visibility. | |
In this case, a single metric name like `security_certificate_expiration` is reused, with the certificate type expressed as a label. The output from the `metrics` endpoint will be similar to the following: |
security_certificate_expiration{node_id="1",tenant="demoapp",certificate_type="client-ca"} 0 | ||
security_certificate_expiration{node_id="1",tenant="demoapp",certificate_type="ui-ca"} 0 | ||
security_certificate_expiration{node_id="1",tenant="demoapp",certificate_type="node"} 1.840654953e+09 | ||
~~~ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
~~~ | |
~~~ | |
This approach avoids a proliferation of metric names while allowing third-party tools to display each certificate's expiration as a separate line in a unified graph or table. |
Fixes DOC-13170
Rendered preview