[Feature] Alerts for the compaction job metrics #603

abdasgupta · 2023-05-26T10:09:28Z

Feature (What you would like to be added):
We are exposing metrics for compaction job in ETCD druid by #569 . Now we need to whitelist the metrics in g/g and raise correct alerts based on the metrics. We have decided to raise alerts when job_duration_seconds and jobs_total with failed labels cross certain thresholds.
Motivation (Why is this needed?):

Approach/Hint to the implement solution (optional):
Threshold for jobs_total with failed label in a seed : 10% of aggregated jobs_total (This alert would be raised per seed)
Threshold for jobs_total with failed label in a shoot: 10% of jobs_total(This alert would be raised per shoot)

To raise the alert at seed level, Aggregate prometheus can be used. The metrics jobs_total with failed label is scraped by cache prometheus from etcd-druid. Aggregate prometheus can aggregate metrics jobs_total from cache prometheus. Seed level alerts can be raised on this aggregated metrics from aggregated prometheus.

To raise the alert at shoot level, alerts can be raised in control plane prometheus. Control plane prometheus already federate shoot specific jobs_total from cache prometheus. So, to raise the alert for jobs_total at shoot level, we need to add an alert for jobs_total in here

Another idea is that we aggregate the alert data that is already raised on shoot control plane prometheus. Alerts from the shoot control plane prometheus are passed to aggregate prometheus in garden namespace. These are just alert data. We can aggregate these alert data streaming from multiple shoots to aggregate prometheus and send alert for jobs_total at seed level.

The text was updated successfully, but these errors were encountered:

DelinaDeng · 2023-08-01T07:21:39Z

We received many alerts in gardener AC canary landscape which says "Pod 60f2a9-compact-job-cb9tv is not ready for more than 30 minutes".

Is it related to this issue?

abdasgupta · 2023-08-02T11:18:21Z

@DelinaDeng It shouldn't be related to this issue. The issue you mentioned seems due to pods not running the container at all due to unavailability of resources or scheduling problem. We can check farther if you give us the cluster details

abdasgupta · 2023-10-16T09:03:51Z

As per an out of band discussion with @shreyas-s-rao , we decided to consider jobs_total with failed label as metrics to raise alert upon crossing a threshold both for shoots and seeds.

renormalize · 2024-04-30T09:42:56Z

After more discussion, it was decided that raising an alert for every compaction job's failure would cause a very large number of alerts to be raised, since the job could fail for a multitude of reasons.

To get a more holistic understanding of the health of the shoots in the seed, alerts are to be raised at a seed level when more than X% (for example, 10%) of the compaction jobs deployed in the seed fail.

abdasgupta added the kind/enhancement Enhancement, improvement, extension label May 26, 2023

ashwani2k assigned abdasgupta May 29, 2023

abdasgupta mentioned this issue Jun 5, 2023

[Feature] ☂️ Monitor compaction jobs running on shoot control planes #610

Open

9 tasks

shreyas-s-rao assigned renormalize and unassigned abdasgupta Mar 4, 2024

renormalize linked a pull request May 13, 2024 that will close this issue

Prometheus alerts for when etcd-druid's snapshot compaction jobs fail above a certain rate gardener/gardener#9739

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Alerts for the compaction job metrics #603

[Feature] Alerts for the compaction job metrics #603

abdasgupta commented May 26, 2023 •

edited

DelinaDeng commented Aug 1, 2023

abdasgupta commented Aug 2, 2023

abdasgupta commented Oct 16, 2023 •

edited

renormalize commented Apr 30, 2024 •

edited

[Feature] Alerts for the compaction job metrics #603

[Feature] Alerts for the compaction job metrics #603

Comments

abdasgupta commented May 26, 2023 • edited

DelinaDeng commented Aug 1, 2023

abdasgupta commented Aug 2, 2023

abdasgupta commented Oct 16, 2023 • edited

renormalize commented Apr 30, 2024 • edited

abdasgupta commented May 26, 2023 •

edited

abdasgupta commented Oct 16, 2023 •

edited

renormalize commented Apr 30, 2024 •

edited