Skip to content

DOC-13170 Product Change- PR #143536 - metric: add /metrics endpoint with static labels #19823

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

florence-crl
Copy link
Contributor

@florence-crl florence-crl commented Jun 23, 2025

Fixes DOC-13170

  • Added prometheus-endpoint.md with info about metrics endpoint.
  • In monitoring-and-alerting.md, moved info in the existing Prometheus endpoint section to the new prometheus-endpoint page.
  • In self-hosted-deployments.json, added link to new page.
  • Replaced instances of ({% link {{ page.version.version }}/monitoring-and-alerting.md %}#prometheus-endpoint) and (#prometheus-endpoint) with ({% link {{ page.version.version }}/prometheus-endpoint.md %}).
  • Replace instances of status/vars with Prometheus endpoint.

Rendered preview

In monitoring-and-alerting.md, moved info in the existing Prometheus endpoint section to the new page.

In self-hosted-deployments.json, added link to new page.
Copy link

netlify bot commented Jun 23, 2025

Deploy Preview for cockroachdb-interactivetutorials-docs canceled.

Name Link
🔨 Latest commit e75663b
🔍 Latest deploy log https://app.netlify.com/projects/cockroachdb-interactivetutorials-docs/deploys/686d66335c9dad00084ab62b

Copy link

github-actions bot commented Jun 23, 2025

Copy link

netlify bot commented Jun 23, 2025

Deploy Preview for cockroachdb-api-docs canceled.

Name Link
🔨 Latest commit e75663b
🔍 Latest deploy log https://app.netlify.com/projects/cockroachdb-api-docs/deploys/686d6633637a5f0008815f9c

Copy link

netlify bot commented Jun 23, 2025

Netlify Preview

Name Link
🔨 Latest commit e75663b
🔍 Latest deploy log https://app.netlify.com/projects/cockroachdb-docs/deploys/686d6633bf3bad0008cfb49c
😎 Deploy Preview https://deploy-preview-19823--cockroachdb-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

…nd-alerting.md %}#prometheus-endpoint) with ({% link {{ page.version.version }}/prometheus-endpoint.md %}).

Replace instances of (#prometheus-endpoint) with ({% link {{ page.version.version }}/prometheus-endpoint.md %}).
Copy link

@kevin-v-ngo kevin-v-ngo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Prometheus endpoint doc looks great!

Copy link
Contributor Author

@florence-crl florence-crl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR

Copy link
Contributor

@dhartunian dhartunian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just one question.

@florence-crl florence-crl requested a review from mikeCRL July 8, 2025 18:45
Copy link
Contributor

@mikeCRL mikeCRL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall. I left some suggestions, and questions to potentially consider.

@@ -175,6 +175,10 @@ Cockroach Labs recommends that you avoid _increasing_ the period of time that DB

### Disable time-series storage

{{site.data.alerts.callout_info}}
Even if you rely on external tools for storing and visualizing your cluster's time-series metrics, CockroachDB continues to store time-series metrics for its [DB Console Metrics dashboards]({% link {{ page.version.version }}/monitoring-and-alerting.md %}#metrics-dashboards). These stored time-series metrics may be used to generate a [tsdump]({% link {{ page.version.version }}/cockroach-debug-tsdump.md %}), which is critical during escalations to Cockroach Labs support.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Even if you rely on external tools for storing and visualizing your cluster's time-series metrics, CockroachDB continues to store time-series metrics for its [DB Console Metrics dashboards]({% link {{ page.version.version }}/monitoring-and-alerting.md %}#metrics-dashboards). These stored time-series metrics may be used to generate a [tsdump]({% link {{ page.version.version }}/cockroach-debug-tsdump.md %}), which is critical during escalations to Cockroach Labs support.
Even if you rely on external tools for storing and visualizing your cluster's time-series metrics, CockroachDB continues to store time-series metrics for its [DB Console Metrics dashboards]({% link {{ page.version.version }}/monitoring-and-alerting.md %}#metrics-dashboards), unless you manually disable this collection. These stored time-series metrics may be used to generate a [tsdump]({% link {{ page.version.version }}/cockroach-debug-tsdump.md %}), which is critical during escalations to Cockroach Labs support.

Edited for clarity and to fit better in the context of potentially disabling, however, I'm not sure why this is a callout—because we want customers to know we're still collecting data, which could have a storage cost, or because you may not want to do this, to preserve tsdump data that might be critical (I am also not sure about saying it "is critical" vs., say, "may be critical".)

In the next paragraph, we seem to say the opposite - it's almost implied that it's not critical

Disabling time-series storage is recommended only if you exclusively use a third-party tool such as [Prometheus]...

Do we mean to say that disabling time-series storage is an option if you exclusively use a third-party tool such as Prometheus, but even then, we recommend keeping it enabled in case it might help to provide it to CockroachDB Support during an issue?

(For that matter, why couldn't we just ask them to give us the data sourced from their third party tool; does it have less fidelity? Is that process/format less reliable?)

Just some food for thought to help inspire edits, or help ask SME/Support what they really care about and how they'd frame this.

In addition to using the exported time-series data to monitor a cluster through an external system, you can write alerting rules to ensure prompt notification of critical events or issues requiring intervention or investigation. Refer to [Essential Alerts]({% link {{ page.version.version }}/essential-alerts-self-hosted.md %}) for more details.
{{site.data.alerts.end}}

Even if you rely on external tools for storing and visualizing your cluster's time-series metrics, CockroachDB continues to store time-series metrics for its [DB Console Metrics dashboards]({% link {{ page.version.version }}/monitoring-and-alerting.md %}#metrics-dashboards). These stored time-series metrics may be used to generate a [tsdump]({% link {{ page.version.version }}/cockroach-debug-tsdump.md %}), which is critical during escalations to Cockroach Labs support.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a mention that it's possible to limit or disable this, here (e.g. to limit storage) and link out to the other page?


### Static labels

Static labels allow segmentation of a metric across various facets for later querying and aggregation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw the later phrase "Another common scenario", which led me to realize I didn't grasp the first scenario, so here's an attempt to characterize/introduce that first scenario.

Suggested change
Static labels allow segmentation of a metric across various facets for later querying and aggregation.
Static labels allow segmentation of a metric across various facets for later querying and aggregation.
One common use of static labels is to support aggregation across related metric types. For example, rather than emitting separate metrics for inserts, selects, updates, and deletes, a single metric like `sql_count` can use a `query_type` label to distinguish among these operations. This enables operators to easily aggregate across query types (e.g., summing all SQL operations) or filter for a specific type using a label-based query.
The following tables contrast unlabeled metrics from the `_status/vars` endpoint with their labeled counterparts from the `metrics` endpoint:

Comment on lines +133 to +135
Another common scenario occurs when each label value represents a disjoint set of categories. An example here is the various certificate expiration metrics, which differ only by the specific certificate they refer to. Operators are unlikely to aggregate these, but may still want to view all certificate expiration metrics on a dashboard.

For example, the output from the `metrics` endpoint will be similar to the following:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Another common scenario occurs when each label value represents a disjoint set of categories. An example here is the various certificate expiration metrics, which differ only by the specific certificate they refer to. Operators are unlikely to aggregate these, but may still want to view all certificate expiration metrics on a dashboard.
For example, the output from the `metrics` endpoint will be similar to the following:
In other cases, label values can represent distinct categories not meant to be aggregated. For example, certificate expiration metrics differ only by the specific certificate type they refer to. Operators are unlikely to sum or average these, but may still want to display them side by side on a dashboard for visibility.
In this case, a single metric name like `security_certificate_expiration` is reused, with the certificate type expressed as a label. The output from the `metrics` endpoint will be similar to the following:

security_certificate_expiration{node_id="1",tenant="demoapp",certificate_type="client-ca"} 0
security_certificate_expiration{node_id="1",tenant="demoapp",certificate_type="ui-ca"} 0
security_certificate_expiration{node_id="1",tenant="demoapp",certificate_type="node"} 1.840654953e+09
~~~
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
~~~
~~~
This approach avoids a proliferation of metric names while allowing third-party tools to display each certificate's expiration as a separate line in a unified graph or table.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants