You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
description:'Thanos Sidecar {{$labels.instance}}%s bucket operations are failing' % location,
summary:'Thanos Sidecar bucket operations are failing',
},
expr: |||
sum by (%(dimensions)s) (rate(thanos_objstore_bucket_operation_failures_total{%(selector)s}[5m])) > 0
||| % thanos.sidecar,
'for': '5m',
labels: {
severity:'critical',
},
Failures happen in distributed systems. Shouldn't we be alerting on the percentage of failed bucket operations rather than the simple rate of failures?
I'd propose rewriting the alert to something like this:
sum by (%(dimensions)s) (rate(thanos_objstore_bucket_operation_failures_total{%(selector)s}[5m]))
/
sum by (%(dimensions)s) (rate(thanos_objstore_bucket_operations_total{%(selector)s}[5m]))
> THRESHOLD
What should this THRESHOLD be? Should we make this change to the alert?
The text was updated successfully, but these errors were encountered:
Alert defined here:
thanos/mixin/alerts/sidecar.libsonnet
Lines 14 to 26 in e752424
Failures happen in distributed systems. Shouldn't we be alerting on the percentage of failed bucket operations rather than the simple rate of failures?
I'd propose rewriting the alert to something like this:
What should this THRESHOLD be? Should we make this change to the alert?
The text was updated successfully, but these errors were encountered: