Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ruler evaluation warning false alarm caused by engine warnings #7354

Open
yeya24 opened this issue May 13, 2024 · 1 comment
Open

Ruler evaluation warning false alarm caused by engine warnings #7354

yeya24 opened this issue May 13, 2024 · 1 comment

Comments

@yeya24
Copy link
Contributor

yeya24 commented May 13, 2024

Problem

Query response warnings were used in Thanos to propagate partial response information of Store APIs.

https://thanos.io/tip/components/rule.md/#must-have-essential-ruler-alerts recommends setting alarm on thanos_rule_evaluation_with_warnings_total metric and we have this alert on Thanos mixins as well.

thanos_rule_evaluation_with_warnings_total. If you choose to use Rules and Alerts with [partial response strategy’s](https://thanos.io/tip/components/rule.md/#partial-response) value as “warn”, this metric will tell you how many evaluation ended up with some kind of warning. To see the actual warnings see WARN log level. This might suggest that those evaluations return partial response and might not be accurate.

However, this metric becomes broken since Prometheus started to propagate warnings from the engine prometheus/prometheus#12152. For example, metric name doesn't end with _total will result a warning and cause thanos_rule_evaluation_with_warnings_total metric to increase and trigger the alarm.

Proposal

  • For thanos_rule_evaluation_with_warnings_total, let's include warnings from partial response only or
  • Remove this alert from Thanos mixin and update the doc.
@yeya24
Copy link
Contributor Author

yeya24 commented May 20, 2024

Any idea how to fix this issue? Currently what I am thinking is to move the partial response warning metric to Thanos Querier and remove it from Ruler.
Thanos Querier is able to detect whether the warning is coming from the storage layer or from the engine so we can emit the correct metric for partial response only.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant