switchover fails on LSN #703

piotrekfus91 · 2021-06-18T09:07:01Z

Hi,
I am trying to do switchover using repmgr. It stops primary node correctly, but after that it hangs during rewind:

postgres@8feb0787ba67:~$ repmgr -f /etc/postgresql/13/main/repmgr.conf standby switchover
NOTICE: executing switchover on node "db2" (ID: 2)
NOTICE: local node "db2" (ID: 2) will be promoted to primary; current primary "db1" (ID: 1) will be demoted to standby
NOTICE: stopping current primary node "db1" (ID: 1)
NOTICE: issuing CHECKPOINT on node "db1" (ID: 1)
DETAIL: executing server command "/usr/lib/postgresql/13/bin/pg_ctl  -D /var/lib/postgresql/13/main -W -m fast stop"
INFO: checking for primary shutdown; 1 of 60 attempts ("shutdown_check_timeout")
INFO: checking for primary shutdown; 2 of 60 attempts ("shutdown_check_timeout")
NOTICE: current primary has been cleanly shut down at location 0/4000028
NOTICE: waiting up to 30 seconds (parameter "wal_receive_check_timeout") for received WAL to flush to disk
INFO: sleeping 1 of maximum 30 seconds waiting for standby to flush received WAL to disk
INFO: sleeping 2 of maximum 30 seconds waiting for standby to flush received WAL to disk
INFO: sleeping 3 of maximum 30 seconds waiting for standby to flush received WAL to disk
[...]
INFO: sleeping 29 of maximum 30 seconds waiting for standby to flush received WAL to disk
INFO: sleeping 30 of maximum 30 seconds waiting for standby to flush received WAL to disk
WARNING: local node "db2" is behind shutdown primary "db1"
DETAIL: local node last receive LSN is 0/3D04000, primary shutdown checkpoint LSN is 0/4000028
NOTICE: aborting switchover
HINT: use --always-promote to force promotion of standby

I tried with --force-rewind=/usr/lib/postgresql/13/bin/pg_rewind, the result is the same. I also created a symlink sudo ln -s /usr/lib/postgresql/13/bin/pg_rewind /usr/bin/pg_rewind, but still to no avail.

repmgr 5.2.0
postgresql 13
ubuntu 20.04 (on docker)
postgresql.override.conf:

listen_addresses = '*'

max_wal_senders = 10

max_replication_slots = 10

wal_level = 'replica'

hot_standby = on

archive_mode = on

archive_command = '/bin/true'

shared_preload_libraries = 'repmgr'

wal_log_hints = on

repmgr.conf:

node_id=2

node_name=db2

conninfo='host=db2 user=repmgr dbname=repmgr connect_timeout=2 password=repmgr'

data_directory='/var/lib/postgresql/13/main'

failover=automatic

promote_command='repmgr standby promote -f /etc/postgresql/13/main/repmgr.conf --log-to-file'

follow_command='repmgr standby follow -f /etc/postgresql/13/main/repmgr.conf --log-to-file --upstream-node-id=%n'

service_start_command='/usr/lib/postgresql/13/bin/pg_ctl  -D /var/lib/postgresql/13/main -W -m fast start'
service_stop_command='/usr/lib/postgresql/13/bin/pg_ctl  -D /var/lib/postgresql/13/main -W -m fast stop'
service_restart_command='/usr/lib/postgresql/13/bin/pg_ctl  -D /var/lib/postgresql/13/main -W -m fast restart'

Any hints, how to solve this problem?

The text was updated successfully, but these errors were encountered:

sandrobordacchini · 2021-12-13T15:51:16Z

Hi,
i have the same issue here with:

Debian 10
repmgr 5.3.0
postgres 12

(no docker involved, just plain vms)

Did you work out a solution?
Thanks.

piotrekfus91 · 2021-12-14T06:34:15Z

I didn't, we plan to change repmgr to something else after half a year of no answer.

alien11689 · 2022-03-24T10:43:28Z

We had the same problem with WAL on postgres 13 and repmgr 5.3.
It happens when Timeline is not equal on nodes:

node1$ repmgr -v -f /etc/postgresql/13/main/repmgr.conf cluster show
NOTICE: using provided configuration file "/etc/postgresql/13/main/repmgr.conf"
INFO: connecting to database
 ID | Name    | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                                          
----+---------+---------+-----------+----------+----------+----------+----------+------------------
 1  | node1   | standby |   running | node2    | default  | 100      | 15       | ...
 2  | node2   | primary | * running |          | default  | 100      | 16       | ...

You can restart standby node:

node1$ sudo systemctl restart postgresql

and timeline will be equal on both nodes:

node1$ repmgr -v -f /etc/postgresql/13/main/repmgr.conf cluster show
NOTICE: using provided configuration file "/etc/postgresql/13/main/repmgr.conf"
INFO: connecting to database
ID | Name    | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                                          
----+---------+---------+-----------+----------+----------+----------+----------+------------------
1  | node1   | standby |   running | node2    | default  | 100      | 16       | ...
2  | node2   | primary | * running |          | default  | 100      | 16       | ...

Next switchover operation should be successful:

node1$ repmgr -v -f /etc/postgresql/13/main/repmgr.conf standby switchover
NOTICE: using provided configuration file "/etc/postgresql/13/main/repmgr.conf"
NOTICE: executing switchover on node "node1" (ID: 1)
INFO: searching for primary node
INFO: checking if node 2 is primary
INFO: current primary node is 2
INFO: SSH connection to host "node2" succeeded
INFO: 0 pending archive files
INFO: replication lag on this standby is 0 seconds
NOTICE: attempting to pause repmgrd on 2 nodes
NOTICE: local node "node1" (ID: 1) will be promoted to primary; current primary "node2" (ID: 2) will be demoted to standby
NOTICE: stopping current primary node "node2" (ID: 2)
NOTICE: issuing CHECKPOINT on node "node2" (ID: 2) 
DETAIL: executing server command "sudo /usr/bin/systemctl stop postgresql"
INFO: checking for primary shutdown; 1 of 60 attempts ("shutdown_check_timeout")
INFO: checking for primary shutdown; 2 of 60 attempts ("shutdown_check_timeout")
NOTICE: current primary has been cleanly shut down at location 1/A8000028
NOTICE: promoting standby to primary
DETAIL: promoting server "node1" (ID: 1) using pg_promote()
NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
INFO: standby promoted to primary after 1 second(s)
NOTICE: STANDBY PROMOTE successful
DETAIL: server "node1" (ID: 1) was successfully promoted to primary
INFO: node "node2" (ID: 2) is pingable
INFO: node "node2" (ID: 2) has attached to its upstream node
NOTICE: node "node1" (ID: 1) promoted to primary, node "node2" (ID: 2) demoted to standby
NOTICE: switchover was successful
DETAIL: node "node1" is now primary and node "node2" is attached as standby
NOTICE: STANDBY SWITCHOVER has completed successfully

Result:

node1$ repmgr -v -f /etc/postgresql/13/main/repmgr.conf cluster show
NOTICE: using provided configuration file "/etc/postgresql/13/main/repmgr.conf"
INFO: connecting to database
 ID | Name    | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                                          
----+---------+---------+-----------+----------+----------+----------+----------+-------------------
 1  | node1   | primary | * running |          | default  | 100      | 17       | ...
 2  | node2   | standby |   running | node1    | default  | 100      | 16       | ...

EamonZhang · 2022-06-07T08:32:40Z

@alien11689

I had the same problem, which could be solved by restarting the standby server or waiting a few minutes.

vyegorov · 2022-06-09T10:30:42Z

I hit the same issue.

Main reason here is the fictive archive_command, if you disable archiving — things works as expected.

To fix, just make archive_command = '{ sleep 5; true; }'. Smaller timeout might work as well.
I am not sure whether this is an repmgr issue or there's a race inside PostgreSQL, though.

fonya · 2022-09-15T20:03:41Z

@vyegorov Thank you very much for your answer, that is the solution: archive_command = '{ sleep 5; true; }'

likingzi · 2022-10-18T02:30:01Z

I hit the same issue.

Main reason here is the fictive archive_command, if you disable archiving — things works as expected.

To fix, just make archive_command = '{ sleep 5; true; }'. Smaller timeout might work as well. I am not sure whether this is an repmgr issue or there's a race inside PostgreSQL, though.

Thank you ! Your reply also solved my same issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

switchover fails on LSN #703

switchover fails on LSN #703

piotrekfus91 commented Jun 18, 2021

sandrobordacchini commented Dec 13, 2021

piotrekfus91 commented Dec 14, 2021

alien11689 commented Mar 24, 2022

EamonZhang commented Jun 7, 2022

vyegorov commented Jun 9, 2022

fonya commented Sep 15, 2022

likingzi commented Oct 18, 2022

switchover fails on LSN #703

switchover fails on LSN #703

Comments

piotrekfus91 commented Jun 18, 2021

sandrobordacchini commented Dec 13, 2021

piotrekfus91 commented Dec 14, 2021

alien11689 commented Mar 24, 2022

EamonZhang commented Jun 7, 2022

vyegorov commented Jun 9, 2022

fonya commented Sep 15, 2022

likingzi commented Oct 18, 2022