-
Notifications
You must be signed in to change notification settings - Fork 605
Add alert for high volume level 1 blocks queried #11803
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add alert for high volume level 1 blocks queried #11803
Conversation
This alert fires when level 1 blocks are being queried for more than 1 hour, indicating that the compactor may not be keeping up with compaction work.
💻 Deploy preview deleted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
alert: $.alertName('HighVolumeLevel1BlocksQueried'), | ||
'for': '6h', | ||
expr: ||| | ||
sum(rate(cortex_bucket_store_series_blocks_queried_sum{component="store-gateway",level="1",%s}[%s])) > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please sum by %(alert_aggregation_labels)s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I backtested this alert at Grafana and it would fire continuously for several cells. I don't think it will be much useful as is.
…book - Add out_of_order="false" matcher to exclude out-of-order blocks from alert - Update runbook to explain out-of-order block exclusion - Add preventative store-gateway scaling guidance to runbook
This is mostly because of out-of-order blocks. I excluded them from the alert. This is the query I tested with and it looks more normal now. It would have fired during a legitimate case of compactor slowdown between Jun 18 and Jun 20 (I couldn't get to the bottom of it though :( )
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the great docs, @dimitarvdimitrov !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I back tested it and, despite it would have firing several times, it's definitely way less noisy than the previous query. Let's give it a try. It may spot real issues.
|
||
How it **works**: | ||
|
||
- Level 1 blocks are deduplicated 2-hour blocks and contain less optimized data compared to higher-level blocks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by "deduplicated" here? They contain duplicated data, that's why they're less optimised, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i thought level 1 are just 2-hour blocks which have been through compaction once?
but now I realize that this may not mean that they have been deduplicated. Is it that they've gone through the split phase of the compactor only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Prometheus there's no L0. The first block compacted by ingesters is L1. Every step in the compaction after that has higher level, even the splitting phase because it's just yet another compaction. Querying level=1 blocks means querying block uploaded by ingesters, which I thought was what we wanted here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think you're right. The level zero blocks don't have a source and they're always not out of order. This makes me think that we're only recording a zero-valued block meta and this is resulting in level zero blocks in the metric. I'll look into why this is. In the meantime, I'll fix the runbook.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
figured it out - see #11891
Summary
Test plan