Skip to content

Add alert for high volume level 1 blocks queried #11803

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Jun 27, 2025

Conversation

dimitarvdimitrov
Copy link
Contributor

Summary

  • Add warning alert that fires when level 1 blocks are queried for more than 1 hour
  • Indicates potential compactor performance issues when store-gateway serves non-compacted blocks

Test plan

  • Verify alert compiles correctly in mixin
  • Monitor for false positives in production environments

This alert fires when level 1 blocks are being queried for more than 1 hour,
indicating that the compactor may not be keeping up with compaction work.
Copy link
Contributor

github-actions bot commented Jun 21, 2025

💻 Deploy preview deleted.

@dimitarvdimitrov dimitarvdimitrov marked this pull request as ready for review June 21, 2025 14:18
@dimitarvdimitrov dimitarvdimitrov requested review from tacole02 and a team as code owners June 21, 2025 14:18
Copy link
Contributor

@56quarters 56quarters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

alert: $.alertName('HighVolumeLevel1BlocksQueried'),
'for': '6h',
expr: |||
sum(rate(cortex_bucket_store_series_blocks_queried_sum{component="store-gateway",level="1",%s}[%s])) > 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please sum by %(alert_aggregation_labels)s

Copy link
Collaborator

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I backtested this alert at Grafana and it would fire continuously for several cells. I don't think it will be much useful as is.

…book

- Add out_of_order="false" matcher to exclude out-of-order blocks from alert
- Update runbook to explain out-of-order block exclusion
- Add preventative store-gateway scaling guidance to runbook
@dimitarvdimitrov
Copy link
Contributor Author

I backtested this alert at Grafana and it would fire continuously for several cells. I don't think it will be much useful as is.

This is mostly because of out-of-order blocks. I excluded them from the alert. This is the query I tested with and it looks more normal now. It would have fired during a legitimate case of compactor slowdown between Jun 18 and Jun 20 (I couldn't get to the bottom of it though :( )

count_over_time(
    (sum by(cluster, out_of_order, namespace) (rate(cortex_bucket_store_series_blocks_queried_sum{component="store-gateway",level="1", out_of_order="false", job=~".*/(store-gateway.*|cortex|mimir|mimir-backend.*)"}[5m])) > 0)
    [6h:1m]
) >= 60 * 6

Copy link
Contributor

@tacole02 tacole02 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great docs, @dimitarvdimitrov !

Copy link
Collaborator

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I back tested it and, despite it would have firing several times, it's definitely way less noisy than the previous query. Let's give it a try. It may spot real issues.


How it **works**:

- Level 1 blocks are deduplicated 2-hour blocks and contain less optimized data compared to higher-level blocks
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by "deduplicated" here? They contain duplicated data, that's why they're less optimised, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i thought level 1 are just 2-hour blocks which have been through compaction once?

but now I realize that this may not mean that they have been deduplicated. Is it that they've gone through the split phase of the compactor only?

@pracucci

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Prometheus there's no L0. The first block compacted by ingesters is L1. Every step in the compaction after that has higher level, even the splitting phase because it's just yet another compaction. Querying level=1 blocks means querying block uploaded by ingesters, which I thought was what we wanted here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's strange because we see level 0 in some blocks
Screenshot 2025-06-27 at 00 35 45

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think you're right. The level zero blocks don't have a source and they're always not out of order. This makes me think that we're only recording a zero-valued block meta and this is resulting in level zero blocks in the metric. I'll look into why this is. In the meantime, I'll fix the runbook.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

figured it out - see #11891

@dimitarvdimitrov dimitarvdimitrov enabled auto-merge (squash) June 27, 2025 11:23
@dimitarvdimitrov dimitarvdimitrov merged commit 79a8af0 into main Jun 27, 2025
35 checks passed
@dimitarvdimitrov dimitarvdimitrov deleted the dimitar/mimir/add-level1-blocks-queried-alert branch June 27, 2025 11:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants