`ThanosSidecarBucketOperationsFailed` alert is flaky #7369

cannonpalms · 2024-05-17T18:13:57Z

Alert defined here:

Lines 14 to 26 in e752424

    
                     { 
        
                       alert: 'ThanosSidecarBucketOperationsFailed', 
        
                       annotations: { 
        
                         description: 'Thanos Sidecar {{$labels.instance}}%s bucket operations are failing' % location, 
        
                         summary: 'Thanos Sidecar bucket operations are failing', 
        
                       }, 
        
                       expr: ||| 
        
                         sum by (%(dimensions)s) (rate(thanos_objstore_bucket_operation_failures_total{%(selector)s}[5m])) > 0 
        
                       ||| % thanos.sidecar, 
        
                       'for': '5m', 
        
                       labels: { 
        
                         severity: 'critical', 
        
                       },

Failures happen in distributed systems. Shouldn't we be alerting on the percentage of failed bucket operations rather than the simple rate of failures?

I'd propose rewriting the alert to something like this:

sum by (%(dimensions)s) (rate(thanos_objstore_bucket_operation_failures_total{%(selector)s}[5m]))
/ 
sum by (%(dimensions)s) (rate(thanos_objstore_bucket_operations_total{%(selector)s}[5m]))
> THRESHOLD

What should this THRESHOLD be? Should we make this change to the alert?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`ThanosSidecarBucketOperationsFailed` alert is flaky #7369

`ThanosSidecarBucketOperationsFailed` alert is flaky #7369

cannonpalms commented May 17, 2024 •

edited

ThanosSidecarBucketOperationsFailed alert is flaky #7369

ThanosSidecarBucketOperationsFailed alert is flaky #7369

Comments

cannonpalms commented May 17, 2024 • edited

`ThanosSidecarBucketOperationsFailed` alert is flaky #7369

`ThanosSidecarBucketOperationsFailed` alert is flaky #7369

cannonpalms commented May 17, 2024 •

edited