Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-1803: Remove compliance_operator_compliance_scan_error_total … #223

Merged
merged 1 commit into from Mar 22, 2023

Conversation

rhmdnd
Copy link

@rhmdnd rhmdnd commented Feb 17, 2023

…metric

This metric contained the scan error, which can exceed lenghts of 2k (sometimes 11k), and causes resource issues with Prometheus and integrating metrics into different storage backends.

This commit removes the metric since it goes against Prometheus best practices:

https://prometheus.io/docs/practices/naming/#labels

@openshift-ci-robot
Copy link
Collaborator

@rhmdnd: This pull request references Jira Issue OCPBUGS-1803, which is invalid:

  • expected the bug to target the "4.13.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

…metric

This metric contained the scan error, which can exceed lenghts of 2k (sometimes 11k), and causes resource issues with Prometheus and integrating metrics into different storage backends.

This commit removes the metric since it goes against Prometheus best practices:

https://prometheus.io/docs/practices/naming/#labels

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

CHANGELOG.md Outdated Show resolved Hide resolved
@Vincent056
Copy link

Vincent056 commented Feb 17, 2023

Maybe we just need to remove the error message labels from the metric, instead of completely removing the error metric?

https://github.com/ComplianceAsCode/compliance-operator/blob/master/pkg/controller/metrics/metrics.go#L166

Just read Matt's comments, and maybe removing the error metric makes sense here

@mkumku
Copy link
Collaborator

mkumku commented Feb 17, 2023

What exactly does this metric provide? Can you please elaborate a bit more about it?

Before we remove it we should notify the users - ideally set a deprecation note first and then remove next release.
If you want to remove it right now and not follow the process - we need to scope the end user impact: how many users are using it (unsure we have a way to figure it out) but maybe it is an insignificant parameter and can be removed wihtout impacting the production environment.

If we decide to go the later route, we can issue an internal KCS for Red Hat customers and send a T3 blog to share with the TAMs/CSM to spread the knowledge that way and move on.

Copy link
Collaborator

@JAORMX JAORMX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhm... is this really the solution? I'd say this really is an issue with the metric's cardinality, and we should instead remove the error message from the metric's labels instead. That would reduce the cardinality greatly. IMO, this is a better solution than removing the metric entirely, as it still gives operators a per-metric view of the error rate in the deployments, as opposed to leaving them in the dark.

I'm not even sure if adding the scan name as metric label is useful. we could probably just have a global error counter.

Copy link

@simonpasquier simonpasquier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prometheus developer here, I agree that keeping the metric but removing the high-cardinality labels is probably better than removing it.

@rhmdnd
Copy link
Author

rhmdnd commented Feb 22, 2023

Thanks for the feedback - I'll respin this.

@xiaojiey
Copy link
Collaborator

/hold for test

Copy link
Collaborator

@JAORMX JAORMX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks better IMO. Thanks!

@openshift-ci openshift-ci bot added the lgtm label Feb 23, 2023
@JAORMX
Copy link
Collaborator

JAORMX commented Feb 23, 2023

/retest

@JAORMX
Copy link
Collaborator

JAORMX commented Feb 23, 2023

@rhmdnd seems you need to update the tests:

#11 201.2       inconsistent label cardinality: expected 2 label values but got 1 in prometheus.Labels{"name":"test"}

…rror_total metric

This metric contained the scan error, which can exceed lenghts of 2k
(sometimes 11k), and causes resource issues with Prometheus and
integrating metrics into different storage backends.

This commit removes the error to reduce cardinality of the metric and
follow Prometheus best practices:

  https://prometheus.io/docs/practices/naming/#labels
@rhmdnd
Copy link
Author

rhmdnd commented Mar 1, 2023

@rhmdnd seems you need to update the tests:

Fixed, and I have a clean run locally. Need to update the metric to actually remove the error label.

@rhmdnd
Copy link
Author

rhmdnd commented Mar 1, 2023

What exactly does this metric provide? Can you please elaborate a bit more about it?

The metric was providing the scan name and the scan error. The error could be a number of different things, which increases cardinality of the metric (potentially bloating promethues and goes against prometheus best practices).

Before we remove it we should notify the users - ideally set a deprecation note first and then remove next release. If you want to remove it right now and not follow the process - we need to scope the end user impact: how many users are using it (unsure we have a way to figure it out) but maybe it is an insignificant parameter and can be removed wihtout impacting the production environment.

If we decide to go the later route, we can issue an internal KCS for Red Hat customers and send a T3 blog to share with the TAMs/CSM to spread the knowledge that way and move on.

We decided to keep the metric, but just remove the error from the metric labels (reducing cardinality) and making the metric more useful.

@xiaojiey
Copy link
Collaborator

xiaojiey commented Mar 7, 2023

Verification pass with 4.13.0-0.nightly-2023-03-07-081835 + code in the PR:

1. install compliance operator with code in the pr
2. Create a compliance scan to trigger an error result:
$ oc create -f - <<EOF
apiVersion: compliance.openshift.io/v1alpha1
kind: ComplianceScan
metadata:
  name: worker-scan2
spec:
  profile: xccdf_org.ssgproject.content_profile_coreos-ncp
  content: ssg-rhcos4-ds.xml
  contentImage: quay.io/complianceascode/ocp4:latest
  debug: true
  nodeSelector:
      node-role.kubernetes.io/worker: ""
EOF
compliancescan.compliance.openshift.io/worker-scan2 created
$ oc get scan -w
NAME           PHASE       RESULT
worker-scan2   LAUNCHING   NOT-AVAILABLE
worker-scan2   LAUNCHING   NOT-AVAILABLE
worker-scan2   RUNNING     NOT-AVAILABLE
worker-scan2   AGGREGATING   NOT-AVAILABLE
worker-scan2   DONE          ERROR
3. Check the metrics:
#######The result after the PR applied is:
$ oc run --rm -i --restart=Never --image=registry.fedoraproject.org/fedora-minimal:latest -n openshift-compliance test-metrics -- bash -c 'curl -ks -H "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" https://metrics.openshift-compliance.svc:8585/metrics-co' | grep compliancece
Warning: would violate PodSecurity "restricted:latest": allowPrivilegeEscalation != false (container "test-metrics" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "test-metrics" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "test-metrics" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container "test-metrics" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
# HELP compliance_operator_compliance_scan_error_total A counter for the total number of errors for a particular scan
# TYPE compliance_operator_compliance_scan_error_total counter
compliance_operator_compliance_scan_error_total{name="worker-scan2"} 1
# HELP compliance_operator_compliance_scan_status_total A counter for the total number of updates to the status of a ComplianceScan
# TYPE compliance_operator_compliance_scan_status_total counter
compliance_operator_compliance_scan_status_total{name="worker-scan2",phase="AGGREGATING",result="NOT-AVAILABLE"} 1
compliance_operator_compliance_scan_status_total{name="worker-scan2",phase="DONE",result="ERROR"} 1
compliance_operator_compliance_scan_status_total{name="worker-scan2",phase="LAUNCHING",result="NOT-AVAILABLE"} 1
compliance_operator_compliance_scan_status_total{name="worker-scan2",phase="PENDING",result=""} 1
compliance_operator_compliance_scan_status_total{name="worker-scan2",phase="RUNNING",result="NOT-AVAILABLE"} 1
#########the result before the pr applied is:
$ orun --rm -i --restart=Never --image=registry.fedoraproject.org/fedora-minimal:latest -n openshift-compliance test-metrics -- bash -c 'curl -ks -H "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" https://metrics.openshift-compliance.svc:8585/metrics-co' | grep compliance
Warning: would violate PodSecurity "restricted:latest": allowPrivilegeEscalation != false (container "test-metrics" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "test-metrics" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "test-metrics" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container "test-metrics" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
# HELP compliance_operator_compliance_scan_error_total A counter for the total number of encounters of error
# TYPE compliance_operator_compliance_scan_error_total counter
compliance_operator_compliance_scan_error_total{error="I: oscap: Identified document type: data-stream-collection\nI: oscap: Created a new XCCDF session from a SCAP Source Datastream '/content/ssg-rhcos4-ds.xml'.\nI: oscap: Validating XML signature.\nI: oscap: Signature node not found\nI: oscap: Identified document type: Benchmark\nI: oscap: Identified document type: cpe-list\nI: oscap: Started new OVAL agent ssg-rhcos4-oval.xml.\nI: oscap: Querying system information.\nI: oscap: Starting probe on URI 'queue://system_info'.\nI: oscap: Switching probe to PROBE_OFFLINE_OWN mode.\nI: oscap: I will run system_info_probe_main:\nNo profile matching suffix \"xccdf_org.ssgproject.content_profile_coreos-ncp\" was found. Get available profiles using:\n$ oscap info \"/content/ssg-rhcos4-ds.xml\"\n",name="worker-scan2"} 1
# HELP compliance_operator_compliance_scan_status_total A counter for the total number of updates to the status of a ComplianceScan
# TYPE compliance_operator_compliance_scan_status_total counter
compliance_operator_compliance_scan_status_total{name="worker-scan2",phase="AGGREGATING",result="NOT-AVAILABLE"} 1
compliance_operator_compliance_scan_status_total{name="worker-scan2",phase="DONE",result="ERROR"} 1
compliance_operator_compliance_scan_status_total{name="worker-scan2",phase="LAUNCHING",result="NOT-AVAILABLE"} 1
compliance_operator_compliance_scan_status_total{name="worker-scan2",phase="PENDING",result=""} 1
compliance_operator_compliance_scan_status_total{name="worker-scan2",phase="RUNNING",result="NOT-AVAILABLE"} 1

@xiaojiey
Copy link
Collaborator

xiaojiey commented Mar 7, 2023

/label qe-approved

@xiaojiey
Copy link
Collaborator

xiaojiey commented Mar 7, 2023

/jira refresh

@openshift-ci-robot
Copy link
Collaborator

@xiaojiey: This pull request references Jira Issue OCPBUGS-1803, which is invalid:

  • expected the bug to target the "4.14.0" version, but it targets "4.13.0" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@xiaojiey
Copy link
Collaborator

xiaojiey commented Mar 7, 2023

/jira refresh

@openshift-ci-robot
Copy link
Collaborator

@xiaojiey: This pull request references Jira Issue OCPBUGS-1803, which is invalid:

  • expected the bug to target only the "4.14.0" version, but multiple target versions were set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@xiaojiey
Copy link
Collaborator

xiaojiey commented Mar 7, 2023

/jira refresh

@openshift-ci-robot
Copy link
Collaborator

@xiaojiey: This pull request references Jira Issue OCPBUGS-1803, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.14.0) matches configured target version for branch (4.14.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @xiaojiey

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@rhmdnd
Copy link
Author

rhmdnd commented Mar 13, 2023

I think we have an issue in some of our cleanup code. I've seen the failure pop up a couple times, but none the tests fail directly.

Opened #258 to track a fix.

@rhmdnd
Copy link
Author

rhmdnd commented Mar 13, 2023

/retest

@rhmdnd
Copy link
Author

rhmdnd commented Mar 17, 2023

@jhrozek @Vincent056 should be ready for another review from dev.

Copy link

@jhrozek jhrozek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci
Copy link

openshift-ci bot commented Mar 21, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JAORMX, jhrozek, rhmdnd

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@rhmdnd
Copy link
Author

rhmdnd commented Mar 21, 2023

Removing the hold label since this was verified.

@rhmdnd
Copy link
Author

rhmdnd commented Mar 22, 2023

/retest-required

DNS issues in CI should be resolved now.

@openshift-merge-robot openshift-merge-robot merged commit 478a2ad into ComplianceAsCode:master Mar 22, 2023
@openshift-ci-robot
Copy link
Collaborator

@rhmdnd: Jira Issue OCPBUGS-1803: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-1803 has been moved to the MODIFIED state.

In response to this:

…metric

This metric contained the scan error, which can exceed lenghts of 2k (sometimes 11k), and causes resource issues with Prometheus and integrating metrics into different storage backends.

This commit removes the metric since it goes against Prometheus best practices:

https://prometheus.io/docs/practices/naming/#labels

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants