Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: fix the issue that duplicated CR metrics are generated when CRD is deleted/applied multiple times #2257

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

CatherineF-dev
Copy link
Contributor

@CatherineF-dev CatherineF-dev commented Dec 6, 2023

What this PR does / why we need it: #2223

# It returns same metrics data many times when CRD is deleted/applied multiple times
 curl localhost:8089/metrics
# HELP cr_creationtimestamp
# TYPE cr_creationtimestamp gauge
cr_creationtimestamp{customresource_group="example.com",customresource_kind="Bar",customresource_version="v1",name="mybar"} 1.701828773e+09
# HELP cr_resourceversion
# TYPE cr_resourceversion gauge
cr_resourceversion{customresource_group="example.com",customresource_kind="Bar",customresource_version="v1",name="mybar"} 909919
# HELP cr_creationtimestamp
# TYPE cr_creationtimestamp gauge
cr_creationtimestamp{customresource_group="example.com",customresource_kind="Bar",customresource_version="v1",name="mybar"} 1.701828773e+09
# HELP cr_resourceversion
# TYPE cr_resourceversion gauge
cr_resourceversion{customresource_group="example.com",customresource_kind="Bar",customresource_version="v1",name="mybar"} 909919

How does this change affect the cardinality of KSM: (increases, decreases or does not change cardinality)

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Mitigation for #2223

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 6, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: CatherineF-dev
Once this PR has been reviewed and has the lgtm label, please assign mrueg for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Dec 6, 2023
@CatherineF-dev CatherineF-dev changed the title Remove obsolete CR metrics Fix obsolete CR metrics Dec 6, 2023
@CatherineF-dev CatherineF-dev changed the title Fix obsolete CR metrics fix: Fix obsolete CR metrics Dec 6, 2023
@CatherineF-dev CatherineF-dev force-pushed the obsolete-cr-metrics branch 2 times, most recently from ea62e16 to 1685de5 Compare December 6, 2023 01:49
@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Dec 6, 2023
@CatherineF-dev CatherineF-dev changed the title fix: Fix obsolete CR metrics fix: [WIP] Fix obsolete CR metrics Dec 6, 2023
@CatherineF-dev CatherineF-dev changed the title fix: [WIP] Fix obsolete CR metrics fix: duplicated CR metrics when CRD is deleted/applied multiple times Dec 6, 2023
@CatherineF-dev CatherineF-dev changed the title fix: duplicated CR metrics when CRD is deleted/applied multiple times fix: fix duplicated CR metrics when CRD is deleted/applied multiple times Dec 6, 2023
@@ -119,9 +119,22 @@ func (b *Builder) WithEnabledResources(r []string) error {
sort.Strings(sortedResources)

b.enabledResources = append(b.enabledResources, sortedResources...)
b.enabledResources = unique(b.enabledResources)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to have a map[string] bool here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using unique is fine as a mitigation.

It didn't remove deleted CR https://github.com/rexagod/kube-state-metrics/blob/25a1d8da057cf761d614c59a52785335d34082d1/internal/discovery/discovery.go#L232. After this is fixed, we can remove unique.

@dgrisonnet
Copy link
Member

I don't think this will fix the inherent issue. It seems that the mechanism detecting the deletion of CRD matched by the ksm config is not working properly. Normally when that happens, the internal store should be deleted alongside its informer because we don't want to expose metrics about this resource anymore.

@dgrisonnet
Copy link
Member

dgrisonnet commented Dec 6, 2023

To be clearer, the scenario where a specific resource appears twice in ksm code should never happen, so we should understand why it happened and fix it instead of enforcing uniqueness.

@CatherineF-dev
Copy link
Contributor Author

got it, agree. Will debug again to find root cause.

@CatherineF-dev
Copy link
Contributor Author

It should be related to #1851 where adding PollForCacheUpdates function.

@CatherineF-dev
Copy link
Contributor Author

CatherineF-dev commented Dec 13, 2023

It deletes removed CR here https://github.com/rexagod/kube-state-metrics/blob/main/internal/discovery/discovery.go#L42C24-L116, while didn't delete removed CR here https://github.com/rexagod/kube-state-metrics/blob/25a1d8da057cf761d614c59a52785335d34082d1/internal/discovery/discovery.go#L232.

How about creating a new issue and merging this PR as a mitigation?
I think it needs some changes around #1851.

Or we only want the fix PR instead of mitigation.

cc @dgrisonnet

@CatherineF-dev CatherineF-dev changed the title fix: fix duplicated CR metrics when CRD is deleted/applied multiple times fix: mitigate the issue that duplicated CR metrics when CRD is deleted/applied multiple times Dec 13, 2023
@CatherineF-dev CatherineF-dev changed the title fix: mitigate the issue that duplicated CR metrics when CRD is deleted/applied multiple times fix: mitigate issue that duplicated CR metrics are generated when CRD is deleted/applied multiple times Dec 13, 2023
@CatherineF-dev CatherineF-dev changed the title fix: mitigate issue that duplicated CR metrics are generated when CRD is deleted/applied multiple times fix: mitigate the issue that duplicated CR metrics are generated when CRD is deleted/applied multiple times Dec 13, 2023
@CatherineF-dev CatherineF-dev changed the title fix: mitigate the issue that duplicated CR metrics are generated when CRD is deleted/applied multiple times fix: fix the issue that duplicated CR metrics are generated when CRD is deleted/applied multiple times Dec 14, 2023
Copy link
Member

@logicalhan logicalhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/triage accepted
/assign @dgrisonnet

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 14, 2023
@rexagod
Copy link
Member

rexagod commented Jan 11, 2024

I second #2257 (comment), and as for #2257 (comment), I'd prefer the changes in this PR (instead of mitigating the issue, we want to identify the root cause and fix that since duplicacy shouldn't ideally be exhibited at all).

@korjek
Copy link

korjek commented Mar 5, 2024

any updates here?
It leads to losing the metrics, which is not good.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants