Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failover status in appliance_console UI is incomplete #164

Open
Fryguy opened this issue Jun 24, 2021 · 8 comments
Open

Failover status in appliance_console UI is incomplete #164

Fryguy opened this issue Jun 24, 2021 · 8 comments

Comments

@Fryguy
Copy link
Member

Fryguy commented Jun 24, 2021

The UI currently shows:

Local Database Server: running (primary)

What it doesn't show, however, is whether standbys are known to the server or not. I propose we change this message to give the user information about the standby servers as well.

Something like:

  • when the standby is not available: Local Database Server: running (primary; 0 standby servers available)
  • when the standby is available: Local Database Server: running (primary; 1 standby server available)
  • when the server is the standby: Local Database Server: running (standby; primary server available)

The difficulty here will be that the failover monitor itself has a delay based on the check frequency, but this runs as a separate service. One possibility could be to send a signal (such as SIGUSR1 or SIGINFO) to the service which dumps the information. Another option is to have a user-friendly proctitle for the service which lists a summary of the information.

@jrafanie
Copy link
Member

Since I will not be able to find where this code is again so quickly... here is where the failover monitor writes the server information with a 300 second sleep between checks and where the new servers are stored when we detect differences between repmgr.nodes and our local cache.

@jrafanie
Copy link
Member

and @servers looks something like:

[{:type=>"primary", :active=>true, :host=>"x.y.z.1", :user=>"root", :dbname=>"vmdb_production"}, {:type=>"standby", :active=>true, :host=>"x.y.z.2", :user=>"root", :dbname=>"vmdb_production"}]

@jrafanie
Copy link
Member

As a followup, when we do add this feature, we should then update the doc to show the proper way to verify the primary and standby databases are up and active.

https://www.manageiq.org/docs/reference/latest/high_availability_guide/index.html#testing-database-failover

@jrafanie
Copy link
Member

jrafanie commented Oct 6, 2022

We need to have 3 types of statuses:

  1. primary role:
  • postgresql started
  • repmgr10/13 should not be required
  • any known standbys and status su postgres -c "repmgr cluster show"
  • If we are a promoted standby, we should denote this as it's hard to tell which is the primary when you reintroduce a failed node (starting postgresql again) as it starts and thinks it's primary but isn't.
  1. standby role
  • postgresql started
  • repmgr10/13 needs to be started systemctl status repmgr10
  • show upstream (primary) and status su postgres -c "repmgr cluster show"
  • If we are a promoted standby, we should denote this as it's hard to tell which is the primary when you reintroduce a failed node (starting postgresql again) as it starts and thinks it's primary but isn't.
  1. application appliances
  • this needs evm-failover-monitor started systemctl status evm-failover-monitor
  • We show the primary (ManageIQ database) but should have the standby(s) too. We can then refresh this screen to see when standbys are added or when failover occurs. Note, poll for changes to the primary/standby configuration and bad primary connection db_check_frequency seconds in ha_admin.yml, so we don't see failovers or new standbys immediately

@jrafanie
Copy link
Member

jrafanie commented Oct 6, 2022

Additionally, we should really really be careful about allowing someone to setup a standby on a database that currently is a primary as the standby configuration causes the database to be reset so it can mirror the primary.

@miq-bot miq-bot added the stale label Feb 27, 2023
@miq-bot
Copy link
Member

miq-bot commented Feb 27, 2023

This issue has been automatically marked as stale because it has not been updated for at least 3 months.

If you can still reproduce this issue on the current release or on master, please reply with all of the information you have about it in order to keep the issue open.

Thank you for all your contributions! More information about the ManageIQ triage process can be found in the triage process documentation.

@Fryguy Fryguy removed the stale label Mar 20, 2023
@miq-bot miq-bot added the stale label Jun 26, 2023
@miq-bot
Copy link
Member

miq-bot commented Jun 26, 2023

This issue has been automatically marked as stale because it has not been updated for at least 3 months.

If you can still reproduce this issue on the current release or on master, please reply with all of the information you have about it in order to keep the issue open.

Thank you for all your contributions! More information about the ManageIQ triage process can be found in the triage process documentation.

@Fryguy Fryguy removed the stale label Jun 26, 2023
@Fryguy
Copy link
Member Author

Fryguy commented Jun 26, 2023

@jrafanie Should we move this help wanted?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants