Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

switchover fails on LSN #703

Open
piotrekfus91 opened this issue Jun 18, 2021 · 7 comments
Open

switchover fails on LSN #703

piotrekfus91 opened this issue Jun 18, 2021 · 7 comments

Comments

@piotrekfus91
Copy link

Hi,
I am trying to do switchover using repmgr. It stops primary node correctly, but after that it hangs during rewind:

postgres@8feb0787ba67:~$ repmgr -f /etc/postgresql/13/main/repmgr.conf standby switchover
NOTICE: executing switchover on node "db2" (ID: 2)
NOTICE: local node "db2" (ID: 2) will be promoted to primary; current primary "db1" (ID: 1) will be demoted to standby
NOTICE: stopping current primary node "db1" (ID: 1)
NOTICE: issuing CHECKPOINT on node "db1" (ID: 1)
DETAIL: executing server command "/usr/lib/postgresql/13/bin/pg_ctl  -D /var/lib/postgresql/13/main -W -m fast stop"
INFO: checking for primary shutdown; 1 of 60 attempts ("shutdown_check_timeout")
INFO: checking for primary shutdown; 2 of 60 attempts ("shutdown_check_timeout")
NOTICE: current primary has been cleanly shut down at location 0/4000028
NOTICE: waiting up to 30 seconds (parameter "wal_receive_check_timeout") for received WAL to flush to disk
INFO: sleeping 1 of maximum 30 seconds waiting for standby to flush received WAL to disk
INFO: sleeping 2 of maximum 30 seconds waiting for standby to flush received WAL to disk
INFO: sleeping 3 of maximum 30 seconds waiting for standby to flush received WAL to disk
[...]
INFO: sleeping 29 of maximum 30 seconds waiting for standby to flush received WAL to disk
INFO: sleeping 30 of maximum 30 seconds waiting for standby to flush received WAL to disk
WARNING: local node "db2" is behind shutdown primary "db1"
DETAIL: local node last receive LSN is 0/3D04000, primary shutdown checkpoint LSN is 0/4000028
NOTICE: aborting switchover
HINT: use --always-promote to force promotion of standby

I tried with --force-rewind=/usr/lib/postgresql/13/bin/pg_rewind, the result is the same. I also created a symlink sudo ln -s /usr/lib/postgresql/13/bin/pg_rewind /usr/bin/pg_rewind, but still to no avail.

repmgr 5.2.0
postgresql 13
ubuntu 20.04 (on docker)
postgresql.override.conf:

listen_addresses = '*'

max_wal_senders = 10

max_replication_slots = 10

wal_level = 'replica'

hot_standby = on

archive_mode = on

archive_command = '/bin/true'

shared_preload_libraries = 'repmgr'

wal_log_hints = on

repmgr.conf:

node_id=2

node_name=db2

conninfo='host=db2 user=repmgr dbname=repmgr connect_timeout=2 password=repmgr'

data_directory='/var/lib/postgresql/13/main'

failover=automatic

promote_command='repmgr standby promote -f /etc/postgresql/13/main/repmgr.conf --log-to-file'

follow_command='repmgr standby follow -f /etc/postgresql/13/main/repmgr.conf --log-to-file --upstream-node-id=%n'

service_start_command='/usr/lib/postgresql/13/bin/pg_ctl  -D /var/lib/postgresql/13/main -W -m fast start'
service_stop_command='/usr/lib/postgresql/13/bin/pg_ctl  -D /var/lib/postgresql/13/main -W -m fast stop'
service_restart_command='/usr/lib/postgresql/13/bin/pg_ctl  -D /var/lib/postgresql/13/main -W -m fast restart'

Any hints, how to solve this problem?

@sandrobordacchini
Copy link

Hi,
i have the same issue here with:

  • Debian 10
  • repmgr 5.3.0
  • postgres 12

(no docker involved, just plain vms)

Did you work out a solution?
Thanks.

@piotrekfus91
Copy link
Author

I didn't, we plan to change repmgr to something else after half a year of no answer.

@alien11689
Copy link

We had the same problem with WAL on postgres 13 and repmgr 5.3.
It happens when Timeline is not equal on nodes:

node1$ repmgr -v -f /etc/postgresql/13/main/repmgr.conf cluster show
NOTICE: using provided configuration file "/etc/postgresql/13/main/repmgr.conf"
INFO: connecting to database
 ID | Name    | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                                          
----+---------+---------+-----------+----------+----------+----------+----------+------------------
 1  | node1   | standby |   running | node2    | default  | 100      | 15       | ...
 2  | node2   | primary | * running |          | default  | 100      | 16       | ...

You can restart standby node:

node1$ sudo systemctl restart postgresql

and timeline will be equal on both nodes:

node1$ repmgr -v -f /etc/postgresql/13/main/repmgr.conf cluster show
NOTICE: using provided configuration file "/etc/postgresql/13/main/repmgr.conf"
INFO: connecting to database
ID | Name    | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                                          
----+---------+---------+-----------+----------+----------+----------+----------+------------------
1  | node1   | standby |   running | node2    | default  | 100      | 16       | ...
2  | node2   | primary | * running |          | default  | 100      | 16       | ...

Next switchover operation should be successful:

node1$ repmgr -v -f /etc/postgresql/13/main/repmgr.conf standby switchover
NOTICE: using provided configuration file "/etc/postgresql/13/main/repmgr.conf"
NOTICE: executing switchover on node "node1" (ID: 1)
INFO: searching for primary node
INFO: checking if node 2 is primary
INFO: current primary node is 2
INFO: SSH connection to host "node2" succeeded
INFO: 0 pending archive files
INFO: replication lag on this standby is 0 seconds
NOTICE: attempting to pause repmgrd on 2 nodes
NOTICE: local node "node1" (ID: 1) will be promoted to primary; current primary "node2" (ID: 2) will be demoted to standby
NOTICE: stopping current primary node "node2" (ID: 2)
NOTICE: issuing CHECKPOINT on node "node2" (ID: 2) 
DETAIL: executing server command "sudo /usr/bin/systemctl stop postgresql"
INFO: checking for primary shutdown; 1 of 60 attempts ("shutdown_check_timeout")
INFO: checking for primary shutdown; 2 of 60 attempts ("shutdown_check_timeout")
NOTICE: current primary has been cleanly shut down at location 1/A8000028
NOTICE: promoting standby to primary
DETAIL: promoting server "node1" (ID: 1) using pg_promote()
NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
INFO: standby promoted to primary after 1 second(s)
NOTICE: STANDBY PROMOTE successful
DETAIL: server "node1" (ID: 1) was successfully promoted to primary
INFO: node "node2" (ID: 2) is pingable
INFO: node "node2" (ID: 2) has attached to its upstream node
NOTICE: node "node1" (ID: 1) promoted to primary, node "node2" (ID: 2) demoted to standby
NOTICE: switchover was successful
DETAIL: node "node1" is now primary and node "node2" is attached as standby
NOTICE: STANDBY SWITCHOVER has completed successfully

Result:

node1$ repmgr -v -f /etc/postgresql/13/main/repmgr.conf cluster show
NOTICE: using provided configuration file "/etc/postgresql/13/main/repmgr.conf"
INFO: connecting to database
 ID | Name    | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                                          
----+---------+---------+-----------+----------+----------+----------+----------+-------------------
 1  | node1   | primary | * running |          | default  | 100      | 17       | ...
 2  | node2   | standby |   running | node1    | default  | 100      | 16       | ...

@EamonZhang
Copy link

@alien11689

I had the same problem, which could be solved by restarting the standby server or waiting a few minutes.

@vyegorov
Copy link

vyegorov commented Jun 9, 2022

I hit the same issue.

Main reason here is the fictive archive_command, if you disable archiving — things works as expected.

To fix, just make archive_command = '{ sleep 5; true; }'. Smaller timeout might work as well.
I am not sure whether this is an repmgr issue or there's a race inside PostgreSQL, though.

@fonya
Copy link

fonya commented Sep 15, 2022

@vyegorov Thank you very much for your answer, that is the solution: archive_command = '{ sleep 5; true; }'

@likingzi
Copy link

I hit the same issue.

Main reason here is the fictive archive_command, if you disable archiving — things works as expected.

To fix, just make archive_command = '{ sleep 5; true; }'. Smaller timeout might work as well. I am not sure whether this is an repmgr issue or there's a race inside PostgreSQL, though.

Thank you ! Your reply also solved my same issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants