OCPBUGS-1803: Remove compliance_operator_compliance_scan_error_total … #223

rhmdnd · 2023-02-17T14:31:46Z

…metric

This metric contained the scan error, which can exceed lenghts of 2k (sometimes 11k), and causes resource issues with Prometheus and integrating metrics into different storage backends.

This commit removes the metric since it goes against Prometheus best practices:

https://prometheus.io/docs/practices/naming/#labels

openshift-ci-robot · 2023-02-17T14:31:50Z

@rhmdnd: This pull request references Jira Issue OCPBUGS-1803, which is invalid:

expected the bug to target the "4.13.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

…metric

This metric contained the scan error, which can exceed lenghts of 2k (sometimes 11k), and causes resource issues with Prometheus and integrating metrics into different storage backends.

This commit removes the metric since it goes against Prometheus best practices:

https://prometheus.io/docs/practices/naming/#labels

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

CHANGELOG.md

Vincent056 · 2023-02-17T16:51:03Z

Maybe we just need to remove the error message labels from the metric, instead of completely removing the error metric?

https://github.com/ComplianceAsCode/compliance-operator/blob/master/pkg/controller/metrics/metrics.go#L166

Just read Matt's comments, and maybe removing the error metric makes sense here

mkumku · 2023-02-17T16:54:48Z

What exactly does this metric provide? Can you please elaborate a bit more about it?

Before we remove it we should notify the users - ideally set a deprecation note first and then remove next release.
If you want to remove it right now and not follow the process - we need to scope the end user impact: how many users are using it (unsure we have a way to figure it out) but maybe it is an insignificant parameter and can be removed wihtout impacting the production environment.

If we decide to go the later route, we can issue an internal KCS for Red Hat customers and send a T3 blog to share with the TAMs/CSM to spread the knowledge that way and move on.

JAORMX

Uhm... is this really the solution? I'd say this really is an issue with the metric's cardinality, and we should instead remove the error message from the metric's labels instead. That would reduce the cardinality greatly. IMO, this is a better solution than removing the metric entirely, as it still gives operators a per-metric view of the error rate in the deployments, as opposed to leaving them in the dark.

I'm not even sure if adding the scan name as metric label is useful. we could probably just have a global error counter.

simonpasquier

Prometheus developer here, I agree that keeping the metric but removing the high-cardinality labels is probably better than removing it.

rhmdnd · 2023-02-22T19:28:28Z

Thanks for the feedback - I'll respin this.

xiaojiey · 2023-02-23T08:01:42Z

/hold for test

JAORMX

This looks better IMO. Thanks!

JAORMX · 2023-02-23T15:59:06Z

/retest

JAORMX · 2023-02-23T15:59:38Z

@rhmdnd seems you need to update the tests:

#11 201.2       inconsistent label cardinality: expected 2 label values but got 1 in prometheus.Labels{"name":"test"}

…rror_total metric This metric contained the scan error, which can exceed lenghts of 2k (sometimes 11k), and causes resource issues with Prometheus and integrating metrics into different storage backends. This commit removes the error to reduce cardinality of the metric and follow Prometheus best practices: https://prometheus.io/docs/practices/naming/#labels

rhmdnd · 2023-03-01T15:58:36Z

@rhmdnd seems you need to update the tests:

Fixed, and I have a clean run locally. Need to update the metric to actually remove the error label.

rhmdnd · 2023-03-01T16:01:16Z

What exactly does this metric provide? Can you please elaborate a bit more about it?

The metric was providing the scan name and the scan error. The error could be a number of different things, which increases cardinality of the metric (potentially bloating promethues and goes against prometheus best practices).

Before we remove it we should notify the users - ideally set a deprecation note first and then remove next release. If you want to remove it right now and not follow the process - we need to scope the end user impact: how many users are using it (unsure we have a way to figure it out) but maybe it is an insignificant parameter and can be removed wihtout impacting the production environment.

If we decide to go the later route, we can issue an internal KCS for Red Hat customers and send a T3 blog to share with the TAMs/CSM to spread the knowledge that way and move on.

We decided to keep the metric, but just remove the error from the metric labels (reducing cardinality) and making the metric more useful.

xiaojiey · 2023-03-07T14:49:45Z

Verification pass with 4.13.0-0.nightly-2023-03-07-081835 + code in the PR:

1. install compliance operator with code in the pr
2. Create a compliance scan to trigger an error result:
$ oc create -f - <<EOF
apiVersion: compliance.openshift.io/v1alpha1
kind: ComplianceScan
metadata:
  name: worker-scan2
spec:
  profile: xccdf_org.ssgproject.content_profile_coreos-ncp
  content: ssg-rhcos4-ds.xml
  contentImage: quay.io/complianceascode/ocp4:latest
  debug: true
  nodeSelector:
      node-role.kubernetes.io/worker: ""
EOF
compliancescan.compliance.openshift.io/worker-scan2 created
$ oc get scan -w
NAME           PHASE       RESULT
worker-scan2   LAUNCHING   NOT-AVAILABLE
worker-scan2   LAUNCHING   NOT-AVAILABLE
worker-scan2   RUNNING     NOT-AVAILABLE
worker-scan2   AGGREGATING   NOT-AVAILABLE
worker-scan2   DONE          ERROR
3. Check the metrics:
#######The result after the PR applied is:
$ oc run --rm -i --restart=Never --image=registry.fedoraproject.org/fedora-minimal:latest -n openshift-compliance test-metrics -- bash -c 'curl -ks -H "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" https://metrics.openshift-compliance.svc:8585/metrics-co' | grep compliancece
Warning: would violate PodSecurity "restricted:latest": allowPrivilegeEscalation != false (container "test-metrics" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "test-metrics" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "test-metrics" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container "test-metrics" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
# HELP compliance_operator_compliance_scan_error_total A counter for the total number of errors for a particular scan
# TYPE compliance_operator_compliance_scan_error_total counter
compliance_operator_compliance_scan_error_total{name="worker-scan2"} 1
# HELP compliance_operator_compliance_scan_status_total A counter for the total number of updates to the status of a ComplianceScan
# TYPE compliance_operator_compliance_scan_status_total counter
compliance_operator_compliance_scan_status_total{name="worker-scan2",phase="AGGREGATING",result="NOT-AVAILABLE"} 1
compliance_operator_compliance_scan_status_total{name="worker-scan2",phase="DONE",result="ERROR"} 1
compliance_operator_compliance_scan_status_total{name="worker-scan2",phase="LAUNCHING",result="NOT-AVAILABLE"} 1
compliance_operator_compliance_scan_status_total{name="worker-scan2",phase="PENDING",result=""} 1
compliance_operator_compliance_scan_status_total{name="worker-scan2",phase="RUNNING",result="NOT-AVAILABLE"} 1
#########the result before the pr applied is:
$ orun --rm -i --restart=Never --image=registry.fedoraproject.org/fedora-minimal:latest -n openshift-compliance test-metrics -- bash -c 'curl -ks -H "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" https://metrics.openshift-compliance.svc:8585/metrics-co' | grep compliance
Warning: would violate PodSecurity "restricted:latest": allowPrivilegeEscalation != false (container "test-metrics" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "test-metrics" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "test-metrics" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container "test-metrics" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
# HELP compliance_operator_compliance_scan_error_total A counter for the total number of encounters of error
# TYPE compliance_operator_compliance_scan_error_total counter
compliance_operator_compliance_scan_error_total{error="I: oscap: Identified document type: data-stream-collection\nI: oscap: Created a new XCCDF session from a SCAP Source Datastream '/content/ssg-rhcos4-ds.xml'.\nI: oscap: Validating XML signature.\nI: oscap: Signature node not found\nI: oscap: Identified document type: Benchmark\nI: oscap: Identified document type: cpe-list\nI: oscap: Started new OVAL agent ssg-rhcos4-oval.xml.\nI: oscap: Querying system information.\nI: oscap: Starting probe on URI 'queue://system_info'.\nI: oscap: Switching probe to PROBE_OFFLINE_OWN mode.\nI: oscap: I will run system_info_probe_main:\nNo profile matching suffix \"xccdf_org.ssgproject.content_profile_coreos-ncp\" was found. Get available profiles using:\n$ oscap info \"/content/ssg-rhcos4-ds.xml\"\n",name="worker-scan2"} 1
# HELP compliance_operator_compliance_scan_status_total A counter for the total number of updates to the status of a ComplianceScan
# TYPE compliance_operator_compliance_scan_status_total counter
compliance_operator_compliance_scan_status_total{name="worker-scan2",phase="AGGREGATING",result="NOT-AVAILABLE"} 1
compliance_operator_compliance_scan_status_total{name="worker-scan2",phase="DONE",result="ERROR"} 1
compliance_operator_compliance_scan_status_total{name="worker-scan2",phase="LAUNCHING",result="NOT-AVAILABLE"} 1
compliance_operator_compliance_scan_status_total{name="worker-scan2",phase="PENDING",result=""} 1
compliance_operator_compliance_scan_status_total{name="worker-scan2",phase="RUNNING",result="NOT-AVAILABLE"} 1

xiaojiey · 2023-03-07T14:50:12Z

/label qe-approved

xiaojiey · 2023-03-07T14:51:53Z

/jira refresh

openshift-ci-robot · 2023-03-07T14:51:56Z

@xiaojiey: This pull request references Jira Issue OCPBUGS-1803, which is invalid:

expected the bug to target the "4.14.0" version, but it targets "4.13.0" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

xiaojiey · 2023-03-07T14:52:32Z

/jira refresh

openshift-ci-robot · 2023-03-07T14:52:35Z

@xiaojiey: This pull request references Jira Issue OCPBUGS-1803, which is invalid:

expected the bug to target only the "4.14.0" version, but multiple target versions were set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

xiaojiey · 2023-03-07T14:53:38Z

/jira refresh

openshift-ci-robot · 2023-03-07T14:53:50Z

@xiaojiey: This pull request references Jira Issue OCPBUGS-1803, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.14.0) matches configured target version for branch (4.14.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @xiaojiey

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

rhmdnd · 2023-03-13T18:15:12Z

I think we have an issue in some of our cleanup code. I've seen the failure pop up a couple times, but none the tests fail directly.

Opened #258 to track a fix.

rhmdnd · 2023-03-13T18:15:22Z

/retest

rhmdnd · 2023-03-17T14:07:29Z

@jhrozek @Vincent056 should be ready for another review from dev.

jhrozek

/lgtm

openshift-ci · 2023-03-21T14:51:41Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JAORMX, jhrozek, rhmdnd

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [jhrozek,rhmdnd]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

rhmdnd · 2023-03-21T15:09:35Z

Removing the hold label since this was verified.

rhmdnd · 2023-03-22T13:25:17Z

/retest-required

DNS issues in CI should be resolved now.

openshift-ci-robot · 2023-03-22T18:03:22Z

@rhmdnd: Jira Issue OCPBUGS-1803: All pull requests linked via external trackers have merged:

ComplianceAsCode/compliance-operator#223

Jira Issue OCPBUGS-1803 has been moved to the MODIFIED state.

In response to this:

…metric

This metric contained the scan error, which can exceed lenghts of 2k (sometimes 11k), and causes resource issues with Prometheus and integrating metrics into different storage backends.

This commit removes the metric since it goes against Prometheus best practices:

https://prometheus.io/docs/practices/naming/#labels

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot added the jira/valid-reference label Feb 17, 2023

openshift-ci-robot added the jira/invalid-bug label Feb 17, 2023

openshift-ci bot requested review from jhrozek and xiaojiey February 17, 2023 14:31

openshift-ci bot added the approved label Feb 17, 2023

sheriff-rh reviewed Feb 17, 2023

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

sheriff-rh added the docs-approved label Feb 17, 2023

rhmdnd force-pushed the OCPBUGS-1803 branch from 069a68e to 917a3dc Compare February 17, 2023 14:58

JAORMX requested changes Feb 20, 2023

View reviewed changes

openshift-ci bot assigned JAORMX Feb 20, 2023

simonpasquier reviewed Feb 20, 2023

View reviewed changes

openshift-ci bot added the do-not-merge/hold label Feb 23, 2023

rhmdnd force-pushed the OCPBUGS-1803 branch from 917a3dc to 5d68f58 Compare February 23, 2023 15:40

JAORMX approved these changes Feb 23, 2023

View reviewed changes

openshift-ci bot added the lgtm label Feb 23, 2023

rhmdnd force-pushed the OCPBUGS-1803 branch from 5d68f58 to 360fd93 Compare March 1, 2023 15:57

openshift-ci bot removed the lgtm label Mar 1, 2023

openshift-ci bot added the qe-approved label Mar 7, 2023

openshift-ci-robot added jira/valid-bug bugzilla/valid-bug and removed jira/invalid-bug labels Mar 7, 2023

rhmdnd requested a review from mkumku March 10, 2023 20:04

rhmdnd mentioned this pull request Mar 13, 2023

End-to-end tests fail cleaning up node pools #258

Open

mkumku added the px-approved label Mar 16, 2023

jhrozek approved these changes Mar 21, 2023

View reviewed changes

openshift-ci bot assigned jhrozek Mar 21, 2023

openshift-ci bot added the lgtm label Mar 21, 2023

rhmdnd removed the do-not-merge/hold label Mar 21, 2023

openshift-merge-robot merged commit 478a2ad into ComplianceAsCode:master Mar 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCPBUGS-1803: Remove compliance_operator_compliance_scan_error_total … #223

OCPBUGS-1803: Remove compliance_operator_compliance_scan_error_total … #223

rhmdnd commented Feb 17, 2023

openshift-ci-robot commented Feb 17, 2023

Vincent056 commented Feb 17, 2023 •

edited

mkumku commented Feb 17, 2023

JAORMX left a comment

simonpasquier left a comment

rhmdnd commented Feb 22, 2023

xiaojiey commented Feb 23, 2023

JAORMX left a comment

JAORMX commented Feb 23, 2023

JAORMX commented Feb 23, 2023

rhmdnd commented Mar 1, 2023

rhmdnd commented Mar 1, 2023

xiaojiey commented Mar 7, 2023 •

edited

xiaojiey commented Mar 7, 2023

xiaojiey commented Mar 7, 2023

openshift-ci-robot commented Mar 7, 2023

xiaojiey commented Mar 7, 2023

openshift-ci-robot commented Mar 7, 2023

xiaojiey commented Mar 7, 2023

openshift-ci-robot commented Mar 7, 2023

rhmdnd commented Mar 13, 2023

rhmdnd commented Mar 13, 2023

rhmdnd commented Mar 17, 2023

jhrozek left a comment

openshift-ci bot commented Mar 21, 2023

rhmdnd commented Mar 21, 2023

rhmdnd commented Mar 22, 2023

openshift-ci-robot commented Mar 22, 2023

OCPBUGS-1803: Remove compliance_operator_compliance_scan_error_total … #223

OCPBUGS-1803: Remove compliance_operator_compliance_scan_error_total … #223

Conversation

rhmdnd commented Feb 17, 2023

openshift-ci-robot commented Feb 17, 2023

Vincent056 commented Feb 17, 2023 • edited

mkumku commented Feb 17, 2023

JAORMX left a comment

Choose a reason for hiding this comment

simonpasquier left a comment

Choose a reason for hiding this comment

rhmdnd commented Feb 22, 2023

xiaojiey commented Feb 23, 2023

JAORMX left a comment

Choose a reason for hiding this comment

JAORMX commented Feb 23, 2023

JAORMX commented Feb 23, 2023

rhmdnd commented Mar 1, 2023

rhmdnd commented Mar 1, 2023

xiaojiey commented Mar 7, 2023 • edited

xiaojiey commented Mar 7, 2023

xiaojiey commented Mar 7, 2023

openshift-ci-robot commented Mar 7, 2023

xiaojiey commented Mar 7, 2023

openshift-ci-robot commented Mar 7, 2023

xiaojiey commented Mar 7, 2023

openshift-ci-robot commented Mar 7, 2023

rhmdnd commented Mar 13, 2023

rhmdnd commented Mar 13, 2023

rhmdnd commented Mar 17, 2023

jhrozek left a comment

Choose a reason for hiding this comment

openshift-ci bot commented Mar 21, 2023

rhmdnd commented Mar 21, 2023

rhmdnd commented Mar 22, 2023

openshift-ci-robot commented Mar 22, 2023

Vincent056 commented Feb 17, 2023 •

edited

xiaojiey commented Mar 7, 2023 •

edited