You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Prio 1 = customer impact, need immediate action
Prio 2 = loss of internal redundancy, need quick action to avoid subsequent Prio 1 on another failure
Prio 3 = certain Prio 1 situation is upcoming, need fast action to avoid
Prio 4 = potential Prio 1 or certain Prio 2 situation is upcoming, need fast action to avoid
In my mind, 1-2 are paging, while 3-4 are causing daytime alerts.
Prio 3 or higher*: As a Cloud Operator, I want to know if multiple replicas of a HA L3 router think they are in master state, because that indicates a potentially customer-visible network issue (ARP fight or L2 loss between two nodes).
Prio 3 or higher*: As a Cloud Operator, I want to know if an HA L3 router has no replica in master state, as that renders the router dysfunctional, which has customer-visible impact, because the traffic is not going to reach the instances because the upstream router cannot find the MAC address to send the traffic to.
Prio 4 or higher*: As a Cloud Operator, I want to know if the number of HA L3 router replicas is below the configured number for a longer time.
(*): both of these can happen temporarily in a healthy system due to monitoring races but also due to propagation delays of state information, so it is a bit tricky how to alert correctly; hence just Prio 3.
The text was updated successfully, but these errors were encountered:
Prio 1 = customer impact, need immediate action
Prio 2 = loss of internal redundancy, need quick action to avoid subsequent Prio 1 on another failure
Prio 3 = certain Prio 1 situation is upcoming, need fast action to avoid
Prio 4 = potential Prio 1 or certain Prio 2 situation is upcoming, need fast action to avoid
In my mind, 1-2 are paging, while 3-4 are causing daytime alerts.
(*): both of these can happen temporarily in a healthy system due to monitoring races but also due to propagation delays of state information, so it is a bit tricky how to alert correctly; hence just Prio 3.
The text was updated successfully, but these errors were encountered: