KEP-4785: Resource State Metrics #4811

rexagod · 2024-08-27T21:06:27Z

One-line PR description: This KEP proposes the incorporation of the CRDMetrics controller into the Kubernetes organization, similar to its existing counterpart for native metrics, Kube State Metrics.

Issue link: Resource State Metrics #4785

Other comments: Please refer to https://github.com/rexagod/resource-state-metrics for more details.

keps/prod-readiness/sig-instrumentation/4785.yaml

rexagod · 2024-09-03T16:47:10Z

keps/sig-instrumentation/4785-crdmetrics-controller/README.md

+Resource State API from Kube State Metrics, and replace it by the CRDMetrics 
+controller which, in addition to its own benefits, would allow Kube State 
+Metrics to drop all Custom Resource State API-specific behaviors that can crash
+Kube State Metrics, directly affecting the availability of native metrics 


kubernetes/kube-state-metrics#2354

kubernetes/kube-state-metrics#2478

keps/sig-instrumentation/4785-crdmetrics-controller/kep.yaml

This KEP proposes the incorporation of the CRDMetrics controller into the Kubernetes organization, similar to its existing counterpart for native metrics, Kube State Metrics. Refer: https://github.com/rexagod/crdmetrics

chrischdi

I'm happy to see this :-)

I will follow along and try to engage and help where possible for me!

keps/sig-instrumentation/4785-crdmetrics-controller/README.md

sftim

Thanks for the PR.

I'm afraid I do have questions.

keps/sig-instrumentation/4785-crdmetrics-controller/README.md

keps/sig-instrumentation/4785-crdmetrics-controller/kep.yaml

keps/sig-instrumentation/4785-crdmetrics-controller/README.md

keps/sig-instrumentation/4785-crdmetrics-controller/kep.yaml

mrueg

Thanks for your work on this @rexagod ! I added a few comments.

keps/sig-instrumentation/4785-crdmetrics-controller/README.md

sftim · 2024-11-08T10:32:44Z

Is CustomResourceMetricsConfig a better name?

SGTM

keps/prod-readiness/sig-instrumentation/4785.yaml

keps/sig-instrumentation/4785-resource-state-metrics/README.md

…85: CRDMetrics Controller

rexagod · 2025-01-21T07:49:25Z

Thank you for the thorough review, @wojtek-t, the out-of-tree nature of this KEP had me a bit confused. I believe this is good for another round of reviews now 🙇🏼

rexagod · 2025-02-05T19:26:17Z

cc @wojtek-t for another review here please.

rexagod · 2025-03-04T17:34:40Z

Friendly ping, @wojtek-t for another look here please.

rexagod · 2025-03-10T17:15:56Z

Bump.

cc @kubernetes/production-readiness if folks have spare cycles to help get this reviewed and merged.

mrueg · 2025-03-10T17:55:12Z

keps/sig-instrumentation/4785-resource-state-metrics/README.md

@@ -0,0 +1,1216 @@
+<!--
+**Note:** When your KEP is complete, all of these comment blocks should be removed.


^ are we good to remove all comment blocks? That will make it easier to review I guess? @rexagod

I believe "complete" here means stable graduation, but I could be wrong.

chrischdi · 2025-03-21T15:25:58Z

keps/sig-instrumentation/4785-resource-state-metrics/README.md

+At the moment, the `spec` houses a single `configuration` field, which defines
+the metric generation configuration as follows (please note that the schema is
+fast-moving at this point and may be subject to change based on the [feedback 
+obtained](https://github.com/rexagod/resource-state-metrics/issues?q=sort%3Aupdated-desc+is%3Aissue+is%3Aopen)):


Will it be considired to be changed in future and have the configuration be a typed api?

WIth having a .spec.configuration field which is a string, we could also just use a configmap or secret instead 🤔

Also to build tooling on top (e.g. to generate a configuration for a CRD from markers) it would be helpful to have this an typed API.

I like the idea of taking a config parameter reference - eg

configurationReference: apiVersion: v1 # for alpha, has to be ConfigMap kind: ConfigMap # for alpha, has to be ConfigMap namespace: example # optional, but must be set for initial / alpha implementation name: rsm-config itemPath: >- data.config

Absolutely, I plan to make the configuration structured in the next CRD version, but a reference to a ConfigMap is definitely doable before that.

chrischdi · 2025-03-21T17:08:46Z

keps/sig-instrumentation/4785-resource-state-metrics/README.md

+        help: "Information about a MyPlatform instance" # The help text for the
+                                                        # metric family, plugged
+                                                        # in as-is.
+        metrics: # Set of metrics to generate under the current metrics family.


can we have an example on how this could look like on arrays inside a resurce?

E.g. metrics for .status.conditions :-)

(let me know if this is too early, as this is technical implementation question)

I've got one in the PoC repository:

labels as input: https://github.com/rexagod/resource-state-metrics/blob/main/manifests/custom-resource.yaml#L111

Output: https://github.com/rexagod/resource-state-metrics/blob/main/tests/conformance_test.go#L33

On a first view that one looks different.

.status.conditions normally is an array, not a map.

Example (simplified):

... status: conditions: - status: "False" type: Ready - status: "True" type: Foo

The target metrics would be:

metric_name{...,type="Ready",status="True"} 0 metric_name{...,type="Ready",status="False"} 1 metric_name{...,type="Ready",status="Unknown"} 0 metric_name{...,type="Foo",status="True"} 1 metric_name{...,type="Foo",status="False"} 0 metric_name{...,type="Foo",status="Unknown"} 0

However, I want to give it another try :-)

Weird, I don't know why I assumed this to be about maps instead of lists. Currently, for the example above, the resulting metrics would look like (assuming .status.condition was the specified expression):

metric_name{...,type="Ready",status="False"} 1 metric_name{...,type="Foo",status="True"} 1

RSM only exposes gauge metrics, as opposed to stateset, counter and info that CRS encompassed, which also proved to be a pain point for its maintenance, as all three of these can be expressed using gauges. RSM takes a different approach in which the nature of metrics are dependant on the constructs supported by the resolver, for e.g.,

value: o.status.phase == 'Running' ? 1 : 0

may be used to achieve a similar effect. In doing so, the intention here is to not expose behaviors (fields) to generate metrics in RSM's configuration itself beyond a point, but instead make them a characteristic of the resolver as much as possible, to avoid the same pitfalls as CRS in the future. To that effect, while the aforementioned behavior can be achieved in CEL, it won't be possible with the unstructured resolver (see kubernetes/kubernetes#128782).

Also, bit off-topic but I've also added recursive parsing for primitive literals as well as composite data-type which also enables generating multiple metric samples from a single metric configuration, something we didn't achieve in CRS.

(Sorry for circling back very late on this. Currently don't have bandwidth...)

That's fine.

I wonder if we then would have to create a metric configuration for each value of phase.
value: o.status.phase == 'Running' ? 1 : 0 looks like we have to hardcode for each phase which exists. Does this then work to still have the same metric name for all?

simonpasquier · 2025-04-25T12:57:26Z

keps/sig-instrumentation/4785-resource-state-metrics/README.md

+-->
+
+The controller offers a number of improvements over Kube State Metrics' Custom
+Resource State API, while maintaining a 3x faster round trip time for metric


not sure to understand the 3x faster bit. Do you mind expanding?

Added a link to https://github.com/rexagod/resource-state-metrics/blob/main/tests/bench/bench.sh.

This was used to get a rough estimate around build times and RTT.

simonpasquier · 2025-04-25T13:10:04Z

keps/sig-instrumentation/4785-resource-state-metrics/README.md

+                            # NO RESOLVER DEFAULTS TO THE "UNSTRUCTURED" ONE.
+            # Set of label-sets to generate for the current metric.
+            labelKeys: # Set of ordered label-keys, static in nature.
+              - "name"


I suppose that for the vast majority of cases, you'd want a name label (as well as a namespacelabel for namespaced custom resources). Should it be enabled by default?

+1 towards the idea of having a way to specify default labels. I don't know if it should necessarily be enabled by default and fixed to name and namespace only. Maybe we add a new field at the same level as the g,v,k,r in the config called defaultLabels and allow users to specify the one they want to be applied to all the metrics that will follow. The result would be kind of similar to what we have today in ksm: https://github.com/kubernetes/kube-state-metrics/blob/0b01e3abce1da521b5e620b8aaa76774bb0fda87/internal/store/pod.go#L39

Default labels may be defined at any scope, like so.

At the moment, we only inject the GVK labels into metrics OOTB, to adhere to CRS' current behavior, but giving this some more thought, I think we could either:

stop exposing the GVK labels OOTB to allow users to create scalar metrics as well, or,

inject name and namespace, as that's common for all CRs, however, this is also the case with other metadata fields (albeit, being less used comparatively), for e.g., uid, creationTimestamp, resourceVersion, etc. which may give rise to users requesting their addition as well.

wojtek-t · 2025-04-25T13:54:39Z

Friendly ping, @wojtek-t for another look here please.

Sorry for huge delay - this looks good enough for me, especially for the external component.

/approve PRR

k8s-ci-robot · 2025-04-25T13:54:49Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mrueg, rexagod, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/prod-readiness/OWNERS~~ [wojtek-t]
~~keps/sig-instrumentation/OWNERS~~ [rexagod]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

keps/sig-instrumentation/4785-resource-state-metrics/README.md

dgrisonnet · 2025-04-28T12:42:55Z

keps/sig-instrumentation/4785-resource-state-metrics/README.md

+supported DSLs to be able to translate their configuration into a format that
+the controller can understand, missing which could result in invalid metrics.
+
+Also, since this aims to deprecate the Custom Resource State API, a timeline 


is there a migration plan?

I'm hoping to have a well-documented configuration mapping for migration once we are closer to GA and the API changes slow down a bit more.

dgrisonnet · 2025-04-28T12:50:46Z

keps/sig-instrumentation/4785-resource-state-metrics/README.md

+- Once fetched, the controller `unmarshal`s the configuration YAML directly into
+`stores` which are a set of metric `families`, which in turn are a set of 
+`metrics`.
+- Metric `stores` are created based on its respective GVKR (a type that embeds 
+`schema.GroupVersionKind`, `schema.GroupVersionResource` to avoid 
+[plural ambiguities]), and reflectors for the specified resource are
+initialized, and populate the stores on its update.
+- `/metrics` pings on `RSM_MAIN_PORT` trigger the server to write the
+raw metrics, combined with its appropriate header(s), in the response. All
+generated metrics are hardcoded to `gauge`s by design, as Prometheus lacks


There are a lot of things in common with ksm, is rsm already re-using some of ksm's code or should we add new interfaces to avoid deduplicating code between the projects?

Maybe we can build parts of ksm as a library, so RSM can reuse those bits?

Yeah that's what I was thinking, but I don't know how much the low level RSM's code differs from KSM

Wherever possible, constructs surfaced to the users are kept conventionally similar to KSM's, to ease migration.

However, something I touched on the KubeCon presentation was having a fail-safe, similar to node_exporter's collector extension pattern, that would allow downstream forks to generate metrics if they do not, or cannot, do so using the supported resolvers. To accomplish this, KSM will need to officially support it's packages, which is something we didn't explicitly do in the past. Currently, /external metrics rely on a 2019 commit for library support.

Do note that we don't strictly need to do this for RSM to graduate, or even support a stable experience to define collectors like so (case in point, openshift-state-metrics that's been and still is in production, which relies on the same commit used in RSM), but it'll be nice to have; and maybe even easier to achieve once KSM is decoupled from CRS.

keps/sig-instrumentation/4785-resource-state-metrics/README.md

dgrisonnet · 2025-04-28T13:00:37Z

keps/sig-instrumentation/4785-resource-state-metrics/README.md

+obtained](https://github.com/rexagod/resource-state-metrics/issues?q=sort%3Aupdated-desc+is%3Aissue+is%3Aopen)):
+
+```yaml
+stores: # Set of metrics stores for each CR we want to generate metrics for.


store is an internal term that is not necessarily easy to understand for end-users. How about calling it resources or metrics?

metrics being a parent of families (which in turn has metric) sounds a bit odd.

resources conflicts with the resource field internally from a DX perspective.

Would it suffice to document the stores field once we have structured configuration in the CR? Or I could expand the configuration documentation to include information for all such fields as well.

┌[rexagod@thegoldenhind] [/dev/pts/3] [main ⚡] └[~/repositories/oss/resource-state-metrics/github.com/rexagod/resource-state-metrics]> k explain resourcemetricsmonitor.spec GROUP: resource-state-metrics.instrumentation.k8s-sigs.io KIND: ResourceMetricsMonitor VERSION: v1alpha1 FIELD: spec <Object> DESCRIPTION: ResourceMetricsMonitorSpec is the spec for a ResourceMetricsMonitor resource. FIELDS: configuration <string> -required- Configuration is the RSM configuration that generates metrics.

Would it suffice to document the stores field once we have structured configuration in the CR?

I am fine with that.

Otherwise an option could be to call it generators? Maybe it is more explicit to users

Sure, generators makes more sense. I'll rename it.

dgrisonnet · 2025-04-28T13:04:08Z

keps/sig-instrumentation/4785-resource-state-metrics/README.md

+            labelKeys: # Set of ordered label-keys, static in nature.
+              - "name"
+              - "static_foo"
+            labelValues: # Set of ordered label values, dynamic in nature.
+                         # Therefore, these may contain static values or
+                         # parse-able expressions.
+              - "o.metadata.name" # Parse-able CEL expression.
+              - "static_foo_value" # Static value.


I feel like having two distinct arrays for label keys and values could be prone to errors. Could we perhaps have them in a map instead? Like:

- labels: name: "o.metadata.name"

I see. There's places in the codebase where logic surrounding labels requires ordering, so we'd have to convert labels defined at either of the three scopes (stores, families, and metrics) to list first.

Preserving original order can also help us pass results based on the query to the value field, which can then work based on that contextual data. We've seen requests like this for CRS which demand a shared context between labels and value, and an ordering should help users specify that in the value field.

keps/sig-instrumentation/4785-resource-state-metrics/README.md

dgrisonnet · 2025-04-28T13:25:01Z

keps/sig-instrumentation/4785-resource-state-metrics/README.md

+-->
+
+The controller abides by the following principals:
+* Garbage in, garbage out: Invalid configurations will generate invalid metrics.


Do you have an example of an invalid configuration that would produce invalid metrics? As far as I can tell since rsm is aware of the custom resource schemas, most if not all of the user inputs could be validated and instead of silencing configuration errors, rsm could surface them to the users

Configuration errors are surfaced in the logs, as well as the managed resource's status field. I'll add a metric to the self port to surface that there as well.

My intention here was to mainly point out that, for instance, a label key that doesn't follow Prometheus' [a-zA-Z_][a-zA-Z0-9_]* format will not raise an error, and still be present in the generated metric, such as {1="1"}, which will be detected at ingestion. However, we still run some validations that make sense on the RSM-level, such as having the same number of label keys and values, which will be surfaced.

dgrisonnet · 2025-04-28T13:26:26Z

keps/sig-instrumentation/4785-resource-state-metrics/README.md

+SIG to get the process for these resources started right away.
+-->
+
+We request a repository (`kubernetes/resource-state-metrics`) to migrate


Since it is a subproject it should land in kubernetes-sigs. kube-state-metrics is in the kubernetes org only because the project predates the sigs org and it would be quite a bit of work ot move to the new org

dgrisonnet · 2025-04-28T13:27:50Z

keps/sig-instrumentation/4785-resource-state-metrics/README.md

+We considered refactoring the Kube State Metrics' Custom Resource State API, but
+that has actually been done multiple times in the past which often amounts to
+us ending up in the same position, owing to its limited scalability.
+


it could be worth also mentioning kubernetes/kube-state-metrics#1978 as an alternative

… KEP-4785: CRDMetrics Controller

Signed-off-by: Pranshu Srivastava <rexagod@gmail.com>

dgrisonnet · 2025-06-17T13:25:16Z

/lgtm

k8s-ci-robot added cncf-cla: yes kind/kep sig/instrumentation labels Aug 27, 2024

k8s-ci-robot requested review from dashpole and dgrisonnet August 27, 2024 21:06

k8s-ci-robot added approved size/XXL labels Aug 27, 2024

rexagod mentioned this pull request Aug 27, 2024

Resource State Metrics #4785

Open

4 tasks

rexagod force-pushed the 4785 branch from ebd7331 to 378ba8d Compare August 27, 2024 21:13

k8s-ci-robot removed the approved label Aug 27, 2024

rexagod force-pushed the 4785 branch from 378ba8d to 3c95d6f Compare August 27, 2024 21:32

kannon92 reviewed Aug 29, 2024

View reviewed changes

keps/prod-readiness/sig-instrumentation/4785.yaml Outdated Show resolved Hide resolved

logicalhan self-assigned this Aug 29, 2024

logicalhan added the lead-opted-in label Aug 29, 2024

rexagod commented Sep 3, 2024

View reviewed changes

dgrisonnet reviewed Sep 12, 2024

View reviewed changes

keps/sig-instrumentation/4785-crdmetrics-controller/kep.yaml Outdated Show resolved Hide resolved

rexagod added 2 commits September 12, 2024 22:50

KEP-4785: CRDMetrics Controller

964f493

This KEP proposes the incorporation of the CRDMetrics controller into the Kubernetes organization, similar to its existing counterpart for native metrics, Kube State Metrics. Refer: https://github.com/rexagod/crdmetrics

fixup! KEP-4785: CRDMetrics Controller

72d03ff

rexagod force-pushed the 4785 branch from 3c95d6f to 72d03ff Compare September 12, 2024 17:23

chrischdi reviewed Sep 20, 2024

View reviewed changes

keps/sig-instrumentation/4785-crdmetrics-controller/README.md Outdated Show resolved Hide resolved

sftim reviewed Oct 19, 2024

View reviewed changes

rexagod mentioned this pull request Oct 21, 2024

Metrics Anomaly Detector #4231

Closed

4 tasks

rexagod added 2 commits November 8, 2024 14:34

fixup! fixup! KEP-4785: CRDMetrics Controller

eaddfdf

fixup! fixup! fixup! KEP-4785: CRDMetrics Controller

c46ed31

rexagod requested review from sftim, dgrisonnet and mrueg November 8, 2024 09:15

fixup! fixup! fixup! fixup! KEP-4785: CRDMetrics Controller

d93c6c3

mrueg reviewed Nov 8, 2024

View reviewed changes

wojtek-t reviewed Jan 8, 2025

View reviewed changes

wojtek-t self-assigned this Jan 8, 2025

fixup! fixup! fixup! fixup! fixup! fixup! fixup! fixup! fixup! KEP-47…

a1a9a7d

…85: CRDMetrics Controller

k8s-ci-robot removed the lgtm label Jan 20, 2025

rexagod requested a review from wojtek-t January 27, 2025 16:53

dashpole removed the lead-opted-in label Feb 13, 2025

mrueg reviewed Mar 10, 2025

View reviewed changes

chrischdi reviewed Mar 21, 2025

View reviewed changes

simonpasquier reviewed Apr 25, 2025

View reviewed changes

k8s-ci-robot added the approved label Apr 25, 2025

dgrisonnet reviewed Apr 28, 2025

View reviewed changes

fixup! fixup! fixup! fixup! fixup! fixup! fixup! fixup! fixup! fixup!…

5b4af61

… KEP-4785: CRDMetrics Controller

rexagod force-pushed the 4785 branch from d617134 to 5b4af61 Compare May 17, 2025 20:26

rexagod requested a review from dgrisonnet May 25, 2025 00:16

review: s/stores/generators

c1a5faa

Signed-off-by: Pranshu Srivastava <rexagod@gmail.com>

k8s-ci-robot assigned dgrisonnet Jun 17, 2025

k8s-ci-robot added the lgtm label Jun 17, 2025

k8s-ci-robot merged commit 9f63a07 into kubernetes:master Jun 17, 2025
4 checks passed

k8s-ci-robot added this to the v1.34 milestone Jun 17, 2025

		@@ -0,0 +1,1216 @@
		<!--
		Note: When your KEP is complete, all of these comment blocks should be removed.

KEP-4785: Resource State Metrics #4811

KEP-4785: Resource State Metrics #4811

Conversation

rexagod commented Aug 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chrischdi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sftim left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mrueg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sftim commented Nov 8, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rexagod commented Jan 21, 2025

Uh oh!

rexagod commented Feb 5, 2025

Uh oh!

rexagod commented Mar 4, 2025

Uh oh!

rexagod commented Mar 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rexagod Apr 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chrischdi Mar 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rexagod May 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rexagod commented Aug 27, 2024 •

edited

Loading

rexagod commented Mar 10, 2025 •

edited

Loading

rexagod Apr 15, 2025 •

edited

Loading

chrischdi Mar 21, 2025 •

edited

Loading

rexagod May 17, 2025 •

edited

Loading

rexagod May 17, 2025 •

edited

Loading

rexagod May 17, 2025 •

edited

Loading

rexagod May 17, 2025 •

edited

Loading