Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

L3 router replication status #98

Open
horazont opened this issue Jun 18, 2021 · 2 comments
Open

L3 router replication status #98

horazont opened this issue Jun 18, 2021 · 2 comments

Comments

@horazont
Copy link
Member

Prio 1 = customer impact, need immediate action
Prio 2 = loss of internal redundancy, need quick action to avoid subsequent Prio 1 on another failure
Prio 3 = certain Prio 1 situation is upcoming, need fast action to avoid
Prio 4 = potential Prio 1 or certain Prio 2 situation is upcoming, need fast action to avoid

In my mind, 1-2 are paging, while 3-4 are causing daytime alerts.

  • Prio 3 or higher*: As a Cloud Operator, I want to know if multiple replicas of a HA L3 router think they are in master state, because that indicates a potentially customer-visible network issue (ARP fight or L2 loss between two nodes).
  • Prio 3 or higher*: As a Cloud Operator, I want to know if an HA L3 router has no replica in master state, as that renders the router dysfunctional, which has customer-visible impact, because the traffic is not going to reach the instances because the upstream router cannot find the MAC address to send the traffic to.
  • Prio 4 or higher*: As a Cloud Operator, I want to know if the number of HA L3 router replicas is below the configured number for a longer time.

(*): both of these can happen temporarily in a healthy system due to monitoring races but also due to propagation delays of state information, so it is a bit tricky how to alert correctly; hence just Prio 3.

This was referenced Aug 12, 2022
@JohnGarbutt
Copy link
Contributor

The other issue I have seen is when it jumps around between nodes, it’s worth looking for that case as well.

@berendt
Copy link
Member

berendt commented Apr 21, 2022

Perhaps https://github.com/osism/openstack-router-status can be recycled in the context.

@itrich itrich transferred this issue from another repository Aug 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants