Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Alerts for the compaction job metrics #603

Open
abdasgupta opened this issue May 26, 2023 · 4 comments · May be fixed by gardener/gardener#9739
Open

[Feature] Alerts for the compaction job metrics #603

abdasgupta opened this issue May 26, 2023 · 4 comments · May be fixed by gardener/gardener#9739
Assignees
Labels
kind/enhancement Enhancement, improvement, extension

Comments

@abdasgupta
Copy link
Contributor

abdasgupta commented May 26, 2023

Feature (What you would like to be added):
We are exposing metrics for compaction job in ETCD druid by #569 . Now we need to whitelist the metrics in g/g and raise correct alerts based on the metrics. We have decided to raise alerts when job_duration_seconds and jobs_total with failed labels cross certain thresholds.
Motivation (Why is this needed?):

Approach/Hint to the implement solution (optional):
Threshold for jobs_total with failed label in a seed : 10% of aggregated jobs_total (This alert would be raised per seed)
Threshold for jobs_total with failed label in a shoot: 10% of jobs_total(This alert would be raised per shoot)

To raise the alert at seed level, Aggregate prometheus can be used. The metrics jobs_total with failed label is scraped by cache prometheus from etcd-druid. Aggregate prometheus can aggregate metrics jobs_total from cache prometheus. Seed level alerts can be raised on this aggregated metrics from aggregated prometheus.

To raise the alert at shoot level, alerts can be raised in control plane prometheus. Control plane prometheus already federate shoot specific jobs_total from cache prometheus. So, to raise the alert for jobs_total at shoot level, we need to add an alert for jobs_total in here

Another idea is that we aggregate the alert data that is already raised on shoot control plane prometheus. Alerts from the shoot control plane prometheus are passed to aggregate prometheus in garden namespace. These are just alert data. We can aggregate these alert data streaming from multiple shoots to aggregate prometheus and send alert for jobs_total at seed level.

@DelinaDeng
Copy link

We received many alerts in gardener AC canary landscape which says "Pod 60f2a9-compact-job-cb9tv is not ready for more than 30 minutes".

Is it related to this issue?

@abdasgupta
Copy link
Contributor Author

@DelinaDeng It shouldn't be related to this issue. The issue you mentioned seems due to pods not running the container at all due to unavailability of resources or scheduling problem. We can check farther if you give us the cluster details

@abdasgupta
Copy link
Contributor Author

abdasgupta commented Oct 16, 2023

As per an out of band discussion with @shreyas-s-rao , we decided to consider jobs_total with failed label as metrics to raise alert upon crossing a threshold both for shoots and seeds.

@renormalize
Copy link
Member

renormalize commented Apr 30, 2024

After more discussion, it was decided that raising an alert for every compaction job's failure would cause a very large number of alerts to be raised, since the job could fail for a multitude of reasons.

To get a more holistic understanding of the health of the shoots in the seed, alerts are to be raised at a seed level when more than X% (for example, 10%) of the compaction jobs deployed in the seed fail.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement Enhancement, improvement, extension
Projects
None yet
3 participants